Install Required Libraries

Step 1 is always to make sure the packages needed are installed. The code below uses several packages that you will need to install in order to follow along: * igraph * network * sna

Other packages will be mentioned along the way, as necessary.

#install.packages("igraph") 
#install.packages("network") 
#install.packages("sna")
#install.packages("ggraph")
#install.packages("visNetwork")
#install.packages("threejs")
#install.packages("networkD3")
#install.packages("ndtv")

Brief introduction to the color palette

Colors are important for SNA and for map making. Different strategies apply to each type of visual analysis.

#Colors in R plots
#install.packages('RColorBrewer')
library('RColorBrewer')
display.brewer.all()

display.brewer.pal(8, "Set3")

display.brewer.pal(8, "Spectral")

display.brewer.pal(8, "Blues")

Network layouts

Network layouts are simply algorithms that return coordinates for each node in a network.Each layout returns a different algorithm.

Below you will find a list of possible layouts we can use with igraph. We are going to stick with the most common layout, called Fruchterman-Reingold, but depending on the results/research question, you might want to explore alternative layouts as they may fit the analysis better. Click here for an awesome guide to layouts using the sna package.

## [1] "layout_as_star"
## [1] "layout_components"
## [1] "layout_in_circle"
## [1] "layout_nicely"
## [1] "layout_on_grid"
## [1] "layout_on_sphere"
## [1] "layout_randomly"
## [1] "layout_with_dh"
## [1] "layout_with_drl"
## [1] "layout_with_fr"
## [1] "layout_with_gem"
## [1] "layout_with_graphopt"
## [1] "layout_with_kk"
## [1] "layout_with_lgl"
## [1] "layout_with_mds"

Data format, size, and preparation

To begin, we will work with a data set containing information about (1) UCCS employees and (2) their connections with students. One involves a network of contacts – number and types of contacts (person and vitural) contacts – between different categories of UCCS employees. The second is a network of links between the employees and UCCS students.

The ideas behind the visualizations will apply to any dataset. However, certain visual properties such as the attributes of nodes are impossible to distinguish in larger graph maps. In fact, when drawing very big networks you may want to hide the network edges, and focus on identifying and visualizing communities of nodes, conduct multimodal analyses (like facets we saw before using ggplot) or provide charts that show key characteristics of the network graph.

Note: there is a component of SNA that uses the API of various websites/apps (e.g. Twitter).

Examining Dataset 1: The edgelist

The first data set we are going to work with consists of two files, “Dataset1-UCCS-Example-NODES.csv” and “Dataset1-UCCS-Example-EDGES.csv”

Load the data below. The data are stored in the “node” and “links” object (Note: these are used for clarity, you can name these objects anything).

nodes <- read.csv("https://raw.githubusercontent.com/elisegia/COC/master/Dataset1-UCCS-NODES.csv", header=T, as.is=T)
links <- read.csv("https://raw.githubusercontent.com/elisegia/COC/master/Dataset1-UCCS-Example-EDGES.csv", header=T, as.is=T)

Examine the data below and make sure it loaded properly into RStudio. Click on the nodes dataset and the links dataset and examine the structure.

head(nodes)
##    id      name level.type type.label years.worked
## 1 s01       Jon          1  Professor           10
## 2 s02      Anna          1  Professor            5
## 3 s03 Henriikka          1  Professor            7
## 4 s04       Gia          1  Professor            9
## 5 s05      Rich          1  Professor           20
## 6 s06      Katy          1  Professor           25
head(links)
##   from  to    type weight
## 1  s01 s02 virtual     22
## 2  s01 s03 virtual     22
## 3  s01 s04 virtual     21
## 4  s01 s15 virtual     20
## 5  s02 s01 virtual     23
## 6  s02 s03 virtual     21

Creating an igraph object

Next we will convert the raw data to an igraph network object. This means formatting the data that R can work with.

To do that, we will use the graph_from_data_frame() function, which takes two data frames as arguments: d and vertices.

  • d describes the edges (also called links or ties) of the network. Its first two columns are the IDs of the “source” and the “target” node for each edge. The columns that follow are edge attributes (weight, type, label, or anything else).
  • vertices starts with a column of node IDs. Any following columns are interpreted as node attributes.

First, make sure to install the igraph library, then invoke the library by typing:

library("igraph")

To create the graph object type:

net <- graph_from_data_frame(d=links, vertices=nodes, directed=T) 
net
## IGRAPH a0c3d0e DNW- 17 49 -- 
## + attr: name (v/c), level.type (v/n), type.label (v/c),
## | years.worked (v/n), type (e/c), weight (e/n)
## + edges from a0c3d0e (vertex names):
##  [1] Jon      ->Anna      Jon      ->Henriikka Jon      ->Gia      
##  [4] Jon      ->Reeses    Anna     ->Jon       Anna     ->Henriikka
##  [7] Anna     ->Oji       Anna     ->Lori      Henriikka->Jon      
## [10] Henriikka->Gia       Henriikka->Rich      Henriikka->Sunni    
## [13] Henriikka->Lori      Henriikka->Maya      Henriikka->Nonee    
## [16] Gia      ->Henriikka Gia      ->Katy      Gia      ->Maya     
## [19] Gia      ->Nonee     Gia      ->Susi      Rich     ->Jon      
## + ... omitted several edges

The description of an igraph object starts with four letters:

  • D or U, for a directed or undirected graph
  • N for a named graph (where nodes have a name attribute)
  • W for a weighted graph (where edges have a weight attribute)
  • B for a bipartite (two-mode) graph (where nodes have a type attribute)

The two numbers that follow (17, 49) refer to the number of nodes and edges in the graph. The description also lists node & edge attributes, for example:

  • (g/c) - graph-level character attribute
  • (v/c) - vertex-level character attribute
  • (e/n) - edge-level numeric attribute

Let’s descibe the type of graph we have here.

We can also access nodes, edges, and their attributes by typing:

E(net)       # The edges of the "net" object
## + 49/49 edges from a0c3d0e (vertex names):
##  [1] Jon      ->Anna      Jon      ->Henriikka Jon      ->Gia      
##  [4] Jon      ->Reeses    Anna     ->Jon       Anna     ->Henriikka
##  [7] Anna     ->Oji       Anna     ->Lori      Henriikka->Jon      
## [10] Henriikka->Gia       Henriikka->Rich      Henriikka->Sunni    
## [13] Henriikka->Lori      Henriikka->Maya      Henriikka->Nonee    
## [16] Gia      ->Henriikka Gia      ->Katy      Gia      ->Maya     
## [19] Gia      ->Nonee     Gia      ->Susi      Rich     ->Jon      
## [22] Rich     ->Anna      Rich     ->Oji       Rich     ->Reeses   
## [25] Katy     ->Katy      Katy     ->Oreo      Katy     ->Susi     
## [28] Kate     ->Henriikka Kate     ->Sunni     Kate     ->Lori     
## + ... omitted several edges
V(net)       # The vertices of the "net" object
## + 17/17 vertices, named, from a0c3d0e:
##  [1] Jon       Anna      Henriikka Gia       Rich      Katy      Kate     
##  [8] Sunni     Oji       Lori      Maya      Nonee     Gigi      Rod      
## [15] Reeses    Oreo      Susi
E(net)$type  # Edge attribute "type"
##  [1] "virtual" "virtual" "virtual" "virtual" "virtual" "virtual" "virtual"
##  [8] "virtual" "virtual" "virtual" "virtual" "virtual" "person"  "virtual"
## [15] "virtual" "virtual" "person"  "person"  "virtual" "person"  "person" 
## [22] "virtual" "virtual" "person"  "virtual" "virtual" "person"  "person" 
## [29] "person"  "virtual" "person"  "virtual" "person"  "person"  "person" 
## [36] "virtual" "person"  "virtual" "person"  "virtual" "person"  "person" 
## [43] "person"  "virtual" "virtual" "virtual" "virtual" "person"  "virtual"
V(net)$name  # Vertex attribute "name"
##  [1] "Jon"       "Anna"      "Henriikka" "Gia"       "Rich"     
##  [6] "Katy"      "Kate"      "Sunni"     "Oji"       "Lori"     
## [11] "Maya"      "Nonee"     "Gigi"      "Rod"       "Reeses"   
## [16] "Oreo"      "Susi"
# Find nodes and edges by attribute:
# (that returns objects of type vertex sequence/edge sequence)
V(net)[name=="Anna"]
## + 1/17 vertex, named, from a0c3d0e:
## [1] Anna
E(net)[type=="virtual"]
## + 30/49 edges from a0c3d0e (vertex names):
##  [1] Jon      ->Anna      Jon      ->Henriikka Jon      ->Gia      
##  [4] Jon      ->Reeses    Anna     ->Jon       Anna     ->Henriikka
##  [7] Anna     ->Oji       Anna     ->Lori      Henriikka->Jon      
## [10] Henriikka->Gia       Henriikka->Rich      Henriikka->Sunni    
## [13] Henriikka->Maya      Henriikka->Nonee     Gia      ->Henriikka
## [16] Gia      ->Nonee     Rich     ->Anna      Rich     ->Oji      
## [19] Katy     ->Katy      Katy     ->Oreo      Kate     ->Lori     
## [22] Sunni    ->Henriikka Lori     ->Henriikka Nonee    ->Gigi     
## [25] Gigi     ->Nonee     Reeses   ->Jon       Reeses   ->Gia      
## [28] Reeses   ->Katy      Oreo     ->Katy      Susi     ->Gia
# You can also examine the network matrix directly:
net[,]
## 17 x 17 sparse Matrix of class "dgCMatrix"
##    [[ suppressing 17 column names 'Jon', 'Anna', 'Henriikka' ... ]]
##                                                            
## Jon        . 22 22 21 .  .  .  .  .  .  .  .  .  . 20  .  .
## Anna      23  . 21  . .  .  .  .  1  5  .  .  .  .  .  .  .
## Henriikka 21  .  . 22 1  .  .  4  .  2  1  1  .  .  .  .  .
## Gia        .  . 23  . .  1  .  .  .  . 22  3  .  .  .  .  2
## Rich       1 21  .  . .  .  .  .  2  .  .  .  .  . 21  .  .
## Katy       .  .  .  . .  1  .  .  .  .  .  .  .  .  . 21 21
## Kate       .  .  1  . .  .  . 22  . 21  .  .  .  4  .  .  .
## Sunni      .  .  2  . .  . 21  . 23  .  .  .  .  .  .  .  .
## Oji        .  .  .  . .  .  .  .  . 21  .  .  .  .  .  .  .
## Lori       .  .  2  . .  .  .  .  .  .  .  .  .  .  .  .  .
## Maya       .  .  .  . .  .  .  .  .  .  .  .  .  .  .  .  .
## Nonee      .  .  .  . .  2  .  .  .  .  .  . 22 22  .  .  .
## Gigi       .  .  .  . .  .  .  .  .  .  . 21  .  .  .  .  1
## Rod        .  .  .  . .  .  .  .  .  .  1  . 21  .  .  .  .
## Reeses    22  .  .  1 .  4  .  .  .  .  .  .  .  .  .  .  .
## Oreo       .  .  .  . . 23  .  .  .  .  .  .  .  .  .  . 21
## Susi       .  .  .  4 .  .  .  .  .  .  .  .  .  .  .  .  .
net[1,]
##       Jon      Anna Henriikka       Gia      Rich      Katy      Kate 
##         0        22        22        21         0         0         0 
##     Sunni       Oji      Lori      Maya     Nonee      Gigi       Rod 
##         0         0         0         0         0         0         0 
##    Reeses      Oreo      Susi 
##        20         0         0
net[5,7]
## [1] 0

Question: What does the code E(net)[type==“virtual”] tell us?

Question: How would we access the cell in row 6 column 9?

Question: What about row 40, column 3?

It is also easy to extract an edge list or matrix back from the igraph network:

# Get an edge list or a matrix:
as_edgelist(net, names=T)
##       [,1]        [,2]       
##  [1,] "Jon"       "Anna"     
##  [2,] "Jon"       "Henriikka"
##  [3,] "Jon"       "Gia"      
##  [4,] "Jon"       "Reeses"   
##  [5,] "Anna"      "Jon"      
##  [6,] "Anna"      "Henriikka"
##  [7,] "Anna"      "Oji"      
##  [8,] "Anna"      "Lori"     
##  [9,] "Henriikka" "Jon"      
## [10,] "Henriikka" "Gia"      
## [11,] "Henriikka" "Rich"     
## [12,] "Henriikka" "Sunni"    
## [13,] "Henriikka" "Lori"     
## [14,] "Henriikka" "Maya"     
## [15,] "Henriikka" "Nonee"    
## [16,] "Gia"       "Henriikka"
## [17,] "Gia"       "Katy"     
## [18,] "Gia"       "Maya"     
## [19,] "Gia"       "Nonee"    
## [20,] "Gia"       "Susi"     
## [21,] "Rich"      "Jon"      
## [22,] "Rich"      "Anna"     
## [23,] "Rich"      "Oji"      
## [24,] "Rich"      "Reeses"   
## [25,] "Katy"      "Katy"     
## [26,] "Katy"      "Oreo"     
## [27,] "Katy"      "Susi"     
## [28,] "Kate"      "Henriikka"
## [29,] "Kate"      "Sunni"    
## [30,] "Kate"      "Lori"     
## [31,] "Kate"      "Rod"      
## [32,] "Sunni"     "Henriikka"
## [33,] "Sunni"     "Kate"     
## [34,] "Sunni"     "Oji"      
## [35,] "Oji"       "Lori"     
## [36,] "Lori"      "Henriikka"
## [37,] "Nonee"     "Katy"     
## [38,] "Nonee"     "Gigi"     
## [39,] "Nonee"     "Rod"      
## [40,] "Gigi"      "Nonee"    
## [41,] "Gigi"      "Susi"     
## [42,] "Rod"       "Maya"     
## [43,] "Rod"       "Gigi"     
## [44,] "Reeses"    "Jon"      
## [45,] "Reeses"    "Gia"      
## [46,] "Reeses"    "Katy"     
## [47,] "Oreo"      "Katy"     
## [48,] "Oreo"      "Susi"     
## [49,] "Susi"      "Gia"
as_adjacency_matrix(net, attr="weight")
## 17 x 17 sparse Matrix of class "dgCMatrix"
##    [[ suppressing 17 column names 'Jon', 'Anna', 'Henriikka' ... ]]
##                                                            
## Jon        . 22 22 21 .  .  .  .  .  .  .  .  .  . 20  .  .
## Anna      23  . 21  . .  .  .  .  1  5  .  .  .  .  .  .  .
## Henriikka 21  .  . 22 1  .  .  4  .  2  1  1  .  .  .  .  .
## Gia        .  . 23  . .  1  .  .  .  . 22  3  .  .  .  .  2
## Rich       1 21  .  . .  .  .  .  2  .  .  .  .  . 21  .  .
## Katy       .  .  .  . .  1  .  .  .  .  .  .  .  .  . 21 21
## Kate       .  .  1  . .  .  . 22  . 21  .  .  .  4  .  .  .
## Sunni      .  .  2  . .  . 21  . 23  .  .  .  .  .  .  .  .
## Oji        .  .  .  . .  .  .  .  . 21  .  .  .  .  .  .  .
## Lori       .  .  2  . .  .  .  .  .  .  .  .  .  .  .  .  .
## Maya       .  .  .  . .  .  .  .  .  .  .  .  .  .  .  .  .
## Nonee      .  .  .  . .  2  .  .  .  .  .  . 22 22  .  .  .
## Gigi       .  .  .  . .  .  .  .  .  .  . 21  .  .  .  .  1
## Rod        .  .  .  . .  .  .  .  .  .  1  . 21  .  .  .  .
## Reeses    22  .  .  1 .  4  .  .  .  .  .  .  .  .  .  .  .
## Oreo       .  .  .  . . 23  .  .  .  .  .  .  .  .  .  . 21
## Susi       .  .  .  4 .  .  .  .  .  .  .  .  .  .  .  .  .
# Or data frames describing nodes and edges:
as_data_frame(net, what="edges")
##         from        to    type weight
## 1        Jon      Anna virtual     22
## 2        Jon Henriikka virtual     22
## 3        Jon       Gia virtual     21
## 4        Jon    Reeses virtual     20
## 5       Anna       Jon virtual     23
## 6       Anna Henriikka virtual     21
## 7       Anna       Oji virtual      1
## 8       Anna      Lori virtual      5
## 9  Henriikka       Jon virtual     21
## 10 Henriikka       Gia virtual     22
## 11 Henriikka      Rich virtual      1
## 12 Henriikka     Sunni virtual      4
## 13 Henriikka      Lori  person      2
## 14 Henriikka      Maya virtual      1
## 15 Henriikka     Nonee virtual      1
## 16       Gia Henriikka virtual     23
## 17       Gia      Katy  person      1
## 18       Gia      Maya  person     22
## 19       Gia     Nonee virtual      3
## 20       Gia      Susi  person      2
## 21      Rich       Jon  person      1
## 22      Rich      Anna virtual     21
## 23      Rich       Oji virtual      2
## 24      Rich    Reeses  person     21
## 25      Katy      Katy virtual      1
## 26      Katy      Oreo virtual     21
## 27      Katy      Susi  person     21
## 28      Kate Henriikka  person      1
## 29      Kate     Sunni  person     22
## 30      Kate      Lori virtual     21
## 31      Kate       Rod  person      4
## 32     Sunni Henriikka virtual      2
## 33     Sunni      Kate  person     21
## 34     Sunni       Oji  person     23
## 35       Oji      Lori  person     21
## 36      Lori Henriikka virtual      2
## 37     Nonee      Katy  person      2
## 38     Nonee      Gigi virtual     22
## 39     Nonee       Rod  person     22
## 40      Gigi     Nonee virtual     21
## 41      Gigi      Susi  person      1
## 42       Rod      Maya  person      1
## 43       Rod      Gigi  person     21
## 44    Reeses       Jon virtual     22
## 45    Reeses       Gia virtual      1
## 46    Reeses      Katy virtual      4
## 47      Oreo      Katy virtual     23
## 48      Oreo      Susi  person     21
## 49      Susi       Gia virtual      4
as_data_frame(net, what="vertices")
##                name level.type type.label years.worked
## Jon             Jon          1  Professor           10
## Anna           Anna          1  Professor            5
## Henriikka Henriikka          1  Professor            7
## Gia             Gia          1  Professor            9
## Rich           Rich          1  Professor           20
## Katy           Katy          1  Professor           25
## Kate           Kate          2 Instructor           40
## Sunni         Sunni          2 Instructor            2
## Oji             Oji          2 Instructor            4
## Lori           Lori          2 Instructor            5
## Maya           Maya          2 Instructor           23
## Nonee         Nonee          3      Staff           15
## Gigi           Gigi          3      Staff           64
## Rod             Rod          3      Staff           34
## Reeses       Reeses          3      Staff           23
## Oreo           Oreo          3      Staff           77
## Susi           Susi          3      Staff            3
plot(net) # most basic plot of the network

Question: What types of structures/properties are evident from this graph?

Notice that Katy has “nominated” herself, meaning in this instance that she communicates or collaborates with herself. This information may or may not be useful given the research question. Either way, we are going to remove duplicates and self-nominations as follows:

net <- simplify(net, remove.multiple = F, remove.loops = T) 

Let’s and reduce the arrow size and remove the labels (we do that by setting them to NA):

plot(net, edge.arrow.size=.4,vertex.label=NA)

Plotting networks with igraph

You may have noticed that you can tweak the parameters for clarity, consistency, etc.

Exercise: Let’s change the edges of the above network graph by making them blue.

Below, I plot the graph using curved edges (edge.curved=.1) and reduce arrow size a bit.

plot(net, edge.arrow.size=.25, edge.curved=.1)

Note that using curved edges will allow you to see multiple links between two nodes (e.g. links going in either direction, or multiplex links)

Let’s change the edge color to red, the node & border color to orange and replace the vertex label with the node names stored in variable “name” in the nodes data set.

Notice you can use the code or the color name.

plot(net, edge.arrow.size=.2, edge.color="red",
     vertex.color="orange", vertex.frame.color="#ffffff",
     vertex.label=V(net)$name, vertex.label.color="navy") 

The second way to set attributes is to add them to the igraph object.

Let’s color our network nodes based on type of professor, and size them based on the number of years worked (more years worked -> larger node).

We will also change the width of the edges based on the variable “weight.”

In this case, I know there are three types of people in my network, I have professors, instructors and staff.

Recap, the

  • size of the node (vertex) represents the scaled value of years worked for that person
  • the color of the node represents the type of employee
  • each node is labeled with the attribute “name”
  • the width of the edge represents the variable “weight” (number of contacts)

Who’s Zooming Who?

# Generate colors based on professor type:
colrs <- c("green", "yellow", "red")

#This assigns green to professor, yellow to instructor and red to staff
V(net)$color <- colrs[V(net)$level.type] #Note: this variable must be numeric

#If you just wanted to visualize professors versus everyone else use:
#V(net)$color <- ifelse(nodes[V(net), 4] == "Professor", "blue", "red")

# We could also use the years worked value:
V(net)$size <- (V(net)$years.worked*10)/18

# The labels are currently node IDs.
# Setting them to NA will render no labels:
V(net)$label <- V(net)$name

# Set edge width based on weight:
E(net)$width <- E(net)$weight/7

#change arrow size and edge color:
E(net)$arrow.size <- .2
E(net)$edge.color <- "gray80"

# We can even set the network layout:
#graph_attr(net, "layout") <- layout_with_lgl
graph_attr(net, "layout") <- layout_with_fr #keeps related vertices "close"
plot(net) 

plot(net) 
legend(x=-1.5, y=-1.1, c("Profesor","Instructor", "Staff"), pch=21,
       col="#777777", pt.bg=colrs, pt.cex=2, cex=.8, bty="n", ncol=3)

Question: What do we learn about UCCS employes from this graph?

  • Instructors, professors seem to hang out with themselves (clusters, colors)
  • Henriikka has higher in group ties than anyone else (size of node)
  • The groups seem to communicate more with themselves than with other groups (size of lines)
  • From a structural perspective, if we wanted to ensure that professsors found out about a change in course scheduling, who would you target?
plot(net, edge.color="orange", vertex.color="gray50") 

plot(net, vertex.shape="none", vertex.label=V(net)$name, 
     vertex.label.font=2, vertex.label.color="gray40",
     vertex.label.cex=.7, edge.color="gray85")

Network measures

Descriptive statistics

hist(links$weight)

mean(links$weight)
## [1] 12.40816
sd(links$weight)
## [1] 9.905635

Network Density

Density is defined as the number of connected ties divided by total possible ties. It’s the number of ties that are connected out of all possible ties.

It addressed the question: How connected is this network? (Higher density where the costs of maintaining the tie are low, ~> .2)

Most social networks are not uniformly connected but rather contain pockets of high density clusters can be identified.

Here, a higher density network is a more connected network

graph.density(net,loop=FALSE)
## [1] 0.1764706

Average path length

The average path length is the average number of steps in the shortest paths necessary to navigate the network.

Average path length gives us a sense of how efficiently the flow of information is through the network – has correlation with density (less dense = more hops to get to any one person (node), on average)

To calculate the average number of hops to get to any node:

mean_distance(net)
## [1] 2.742188

Intuitively, this means I can get to anyone in the network in an average of 2.74 ties.

Note: don’t look at any one measure, but rather look at the measures collectively to tell the story. For example, we may think that decreasing the mean path length may provide a more efficient information flow, but to gauge efficiency we would want to compare the average path length with the network density coefficient. This is because if the density remains constant but the average path legnth decreases, the network has a very small number of highly connected nodes. It doesn’t necessarily mean connections are happening in other parts of the network.

Transitivity

Assume that I am friends with Anna, and Anna has a friend

If a network is highly transitive, then since I am friends with Anna, I am also friends with that third friend

Transitivity can give an idea about potential underlying mechanisms in the network Gangs have high transitivity There may be a social mechanism enforcing the transitivity we observe – One question may be whether the clusters have a unique impact on an outcome variable

Closely related to transitivity is clustering. A cluster is defined as the proportion of closed triads over over all triads in the network (open and closed).

Clustering measures the extent to which a group of nodes are more connected to each other than to others.

A highly clustered network may be indicative of bottelnecks where “stuff” is kept within cluster members or from other clusters. Clusters can indicate “group think”.

Community detection methods can identify clusters in a network.

 transitivity(net)
## [1] 0.372549

Community detection (to identify clustering) is used to mark densely connected nodes. Within the group there are dense connections and between the groups there are sparser connections.

# Community detection
cnet <- cluster_edge_betweenness(net)
## Warning in cluster_edge_betweenness(net): At community.c:460 :Membership
## vector will be selected based on the lowest modularity score.
## Warning in cluster_edge_betweenness(net): At community.c:467 :Modularity
## calculation with weighted edge betweenness community detection might not
## make sense -- modularity treats edge weights as similarities while edge
## betwenness treats them as distances
# Community detection returns an object of class "communities" which can be plotted: 
class(cnet)
## [1] "communities"
plot(cnet,
     net,
     vertex.size = 10,
     vertex.label.cex = 0.8)

You can look at the actual values being clustered by typing this:

edge_betweenness(net, directed=T, weights=NA)
##  [1] 10.833333 11.333333  8.333333  9.500000  4.000000 12.500000  3.000000
##  [8]  2.333333 24.000000 16.000000 31.500000 32.500000  9.500000  6.500000
## [15] 23.000000 65.333333 11.000000  6.500000 18.000000  8.666667  5.333333
## [22] 10.000000  6.000000 11.166667 15.000000 21.333333 10.000000  2.000000
## [29]  1.333333  4.500000 11.833333 16.833333  6.833333 16.833333 31.000000
## [36] 17.000000 18.000000 14.500000  7.500000 28.500000  3.000000 17.000000
## [43]  5.666667  9.666667  6.333333  1.000000 15.000000 74.500000

In any given case, another visualzation may work better. Below we create a hierarchical dendogram of scores to show relationships in a different way:

dendPlot(cnet, mode="hclust")

Each “box” represents the individuals in a densely connected group.(Sorry Anna!)

#ratio of number of edges to number of possibilities
edge_density(net, loops = F)
## [1] 0.1764706
ecount(net)/(vcount(net)*(vcount(net)-1))
## [1] 0.1764706

Positional features tell us who are the important players in the network.

Degree distribution

A nodes’ degree centrality refers to the number of ties it has

The degree is the number of connections any node has.

The metrics associated with “degree” addresse the question of: Are high (low) connected people unique in some way or ‘at risk’ in some way.

There are two types of degree:

  • A node’s in-degree centrality refers to the number of links it receives
  • A node’s in-degree centrality refers to the number of links it sends out
degree(net, mode='all') #number of connections (note: in + out = all)
##       Jon      Anna Henriikka       Gia      Rich      Katy      Kate 
##         8         6        13         9         5         6         5 
##     Sunni       Oji      Lori      Maya     Nonee      Gigi       Rod 
##         5         4         5         3         6         4         4 
##    Reeses      Oreo      Susi 
##         5         3         5
degree(net, mode='in') #how many arrows are coming into a vertex (e.g. Jon)
##       Jon      Anna Henriikka       Gia      Rich      Katy      Kate 
##         4         2         6         4         1         4         1 
##     Sunni       Oji      Lori      Maya     Nonee      Gigi       Rod 
##         2         3         4         3         3         2         2 
##    Reeses      Oreo      Susi 
##         2         1         4
degree(net, mode='out')
##       Jon      Anna Henriikka       Gia      Rich      Katy      Kate 
##         4         4         7         5         4         2         4 
##     Sunni       Oji      Lori      Maya     Nonee      Gigi       Rod 
##         3         1         1         0         3         2         2 
##    Reeses      Oreo      Susi 
##         3         2         1
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.1
# Plot the degree distribution for our network
deg.dist <- degree_distribution(net, cumulative=F, mode="all")

netw_degree <- as.data.frame(deg.dist)
    
qplot(deg.dist, data=netw_degree, geom="histogram", binwidth=.05, 
      xlab="Degree", ylab="Probability Density")

deg.dist <- degree_distribution(net, cumulative=T, mode="all")
   

plot(x=0:max(degree(net)), y=1-deg.dist, pch=19, cex=1.2, col="orange", 
      xlab="Degree", ylab="Cumulative Frequency")

Question: What is the difference between these plots?

Distibution of nodes

# Histogram of node degree
V(net)$label <- V(net)$name
V(net)$degree <- degree(net)
hist(V(net)$degree,
     col = 'green',
     main = 'Histogram of Node Degree',
     ylab = 'Frequency',
     xlab = 'Degree of Vertices')

Closeness centrality measures how quickly an entity can access more entities in a network. An entity or person with a high closeness centrality generally has quick access to other entities in a network.

Betweenness centrality identifies an entity’s position within a network in terms of its ability to make connections to other pairs or groups in a network. An entity or person with high betweenness centraltiy has a greater amount of influence over what happens in a network.

reciprocity(net) # are people sending out ties also getting them?
## [1] 0.4166667
closeness(net, mode="all", weights = NA) # how close any one node is to anyone else
##        Jon       Anna  Henriikka        Gia       Rich       Katy 
## 0.03333333 0.03030303 0.04166667 0.03846154 0.03225806 0.03125000 
##       Kate      Sunni        Oji       Lori       Maya      Nonee 
## 0.03030303 0.02857143 0.02564103 0.02941176 0.03225806 0.03571429 
##       Gigi        Rod     Reeses       Oreo       Susi 
## 0.02702703 0.02941176 0.03030303 0.02222222 0.02857143
#can change mode to "in" (how close are you to people sending you ties) or "out"

betweenness(net, directed = T, weights = NA)
##         Jon        Anna   Henriikka         Gia        Rich        Katy 
##  24.0000000   5.8333333 127.0000000  93.5000000  16.5000000  20.3333333 
##        Kate       Sunni         Oji        Lori        Maya       Nonee 
##   1.8333333  19.5000000   0.8333333  15.0000000   0.0000000  33.5000000 
##        Gigi         Rod      Reeses        Oreo        Susi 
##  20.0000000   4.0000000   5.6666667   0.0000000  58.5000000

Question: Who has quicker access to other colleagues in the network?

Question: Who has relatively more influence over their colleagues?

Question: Is there a high degree of mutual relationships in this network?

Here we have a loosely connected network not a core periphery one so that’s why the closeness scores are not highly variable. This may mean nobody can uniquely diffuse information in this network, for example.

We can plot these measures onto our graph as attributes as below.

#Layout Options
set.seed(3952)  # set seed to make the layout reproducible
layout1 <- layout.fruchterman.reingold(net,niter=500)

#Node or Vetex Options: Size and Color
V(net)$size=betweenness(net)/5 
#because we have wide range, I am dividing by 5 to keep the high degree nodes from overshadowing everything else.
V(net)$color <- ifelse(nodes[V(net), 4] == "Professor", "green", "yellow")

#Edge Options: Color
E(net)$color <- "grey"

#Plotting, Now Specifying an arrow size and getting rid of arrow heads
#We are letting the color and the size of the node indicate the directed nature of the graph
plot(net, edge.arrow.size=.45)

Clearly Henriikka has the highest betweeness centrality here.

Here we keep links that have weight higher than the mean for the network. In igraph, we can delete edges using delete_edges(net, edges):

cut.off <- mean(links$weight) 
net.sp <- delete_edges(net, E(net)[weight<cut.off])
plot(net.sp, vertex.label=V(net)$name, layout=layout_with_kk) 

Faceting networks

Make the edges a different color depending on the type of connection, virtual or in person.

E(net)$width <- 1.5
plot(net, edge.color=c("dark red", "slategrey")[(E(net)$type=="virtual")+1],
     vertex.color="gray40", layout=layout.fruchterman.reingold, edge.curved=.3)

net.m <- net - E(net)[E(net)$type=="virtual"] # another way to delete edges
net.h <- net - E(net)[E(net)$type=="person"]   # using the minus operator

# Plot the two links separately:
par(mfrow=c(1,2))
plot(net.h, vertex.color="orange", layout=layout_with_fr, main="Virtual Connections")
plot(net.m, vertex.color="lightsteelblue2", layout=layout_with_fr, main="Person Connections")

A ggraph package example (for ggplot2 users)

library(ggraph)
## Warning: package 'ggraph' was built under R version 3.6.1
library(igraph)

ggraph(net) +
  geom_edge_link() +   # add edges to the plot
  geom_node_point()    # add nodes to the plot
## Using `stress` as default layout

ggraph(net, layout="fr") +
  geom_edge_link() +
  ggtitle("Look ma, no nodes!")  # add title to the plot

ggraph(net, layout="fr") +
  geom_edge_fan(color="gray50", width=0.8) + 
  geom_node_point(color=V(net)$color, size=8) +
  theme_void()

ggraph(net, layout = 'linear') + 
  geom_edge_arc(color = "orange", width=0.8) +
  geom_node_point(size=5, color="gray50") +
  theme_void()

ggraph(net, layout="lgl") +
  geom_edge_link(aes(color = type)) +           # colors by edge type 
  geom_node_point(aes(size = years.worked)) +  # size by audience size  
  theme_void()

ggraph(net,  layout = 'lgl') +
  geom_edge_arc(color="gray", curvature=0.3) +            
  geom_node_point(color="orange", aes(size = years.worked)) +     
  geom_node_text(aes(label = name), size=3, color="gray50", repel=T) +
  theme_void()
## Warning: The curvature argument has been deprecated in favour of strength

Plotting two-mode networks

As you might remember, our second media example is a two-mode network examining links between news sources and their consumers.

nodes2 <- read.csv("https://raw.githubusercontent.com/elisegia/COC/master/Dataset2-UCCS-User-Example-NODES.csv", header=T, as.is=T)
links2 <- read.csv("https://raw.githubusercontent.com/elisegia/COC/master/Dataset2-UCCS-User-Example-EDGES.csv", header=T, row.names=1)

head(nodes2)
##    id      name level.type type.label years.worked
## 1 s01       Jon          1  Professor           10
## 2 s02      Anna          1  Professor            5
## 3 s03 Henriikka          1  Professor            7
## 4 s04       Gia          1  Professor            9
## 5 s05      Rich          1  Professor           20
## 6 s06      Katy          1  Professor           25
head(links2)
##     U01 U02 U03 U04 U05 U06 U07 U08 U09 U10 U11 U12 U13 U14 U15 U16 U17
## s01   1   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## s02   0   0   0   1   1   0   0   0   0   0   0   0   0   0   0   0   0
## s03   0   0   0   0   0   1   1   1   1   0   0   0   0   0   0   0   0
## s04   0   0   0   0   0   0   0   0   1   1   1   0   0   0   0   0   0
## s05   0   0   0   0   0   0   0   0   0   0   1   1   1   0   0   0   0
## s06   0   0   0   0   0   0   0   0   0   0   0   0   1   1   0   0   1
##     U18 U19 U20
## s01   0   0   0
## s02   0   0   1
## s03   0   0   0
## s04   0   0   0
## s05   0   0   0
## s06   0   0   0
links2 <- as.matrix(links2)
dim(links2)
## [1] 10 20
dim(nodes2)
## [1] 30  5
head(nodes2)
##    id      name level.type type.label years.worked
## 1 s01       Jon          1  Professor           10
## 2 s02      Anna          1  Professor            5
## 3 s03 Henriikka          1  Professor            7
## 4 s04       Gia          1  Professor            9
## 5 s05      Rich          1  Professor           20
## 6 s06      Katy          1  Professor           25
head(links2)
##     U01 U02 U03 U04 U05 U06 U07 U08 U09 U10 U11 U12 U13 U14 U15 U16 U17
## s01   1   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## s02   0   0   0   1   1   0   0   0   0   0   0   0   0   0   0   0   0
## s03   0   0   0   0   0   1   1   1   1   0   0   0   0   0   0   0   0
## s04   0   0   0   0   0   0   0   0   1   1   1   0   0   0   0   0   0
## s05   0   0   0   0   0   0   0   0   0   0   1   1   1   0   0   0   0
## s06   0   0   0   0   0   0   0   0   0   0   0   0   1   1   0   0   1
##     U18 U19 U20
## s01   0   0   0
## s02   0   0   1
## s03   0   0   0
## s04   0   0   0
## s05   0   0   0
## s06   0   0   0
net2 <- graph_from_incidence_matrix(links2)
table(V(net2)$type)
## 
## FALSE  TRUE 
##    10    20
head(nodes2)
##    id      name level.type type.label years.worked
## 1 s01       Jon          1  Professor           10
## 2 s02      Anna          1  Professor            5
## 3 s03 Henriikka          1  Professor            7
## 4 s04       Gia          1  Professor            9
## 5 s05      Rich          1  Professor           20
## 6 s06      Katy          1  Professor           25
head(links2)
##     U01 U02 U03 U04 U05 U06 U07 U08 U09 U10 U11 U12 U13 U14 U15 U16 U17
## s01   1   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## s02   0   0   0   1   1   0   0   0   0   0   0   0   0   0   0   0   0
## s03   0   0   0   0   0   1   1   1   1   0   0   0   0   0   0   0   0
## s04   0   0   0   0   0   0   0   0   1   1   1   0   0   0   0   0   0
## s05   0   0   0   0   0   0   0   0   0   0   1   1   1   0   0   0   0
## s06   0   0   0   0   0   0   0   0   0   0   0   0   1   1   0   0   1
##     U18 U19 U20
## s01   0   0   0
## s02   0   0   1
## s03   0   0   0
## s04   0   0   0
## s05   0   0   0
## s06   0   0   0
net2
## IGRAPH aa576ef UN-B 30 31 -- 
## + attr: type (v/l), name (v/c)
## + edges from aa576ef (vertex names):
##  [1] s01--U01 s01--U02 s01--U03 s02--U04 s02--U05 s02--U20 s03--U06
##  [8] s03--U07 s03--U08 s03--U09 s04--U09 s04--U10 s04--U11 s05--U11
## [15] s05--U12 s05--U13 s06--U13 s06--U14 s06--U17 s07--U14 s07--U15
## [22] s07--U16 s08--U16 s08--U17 s08--U18 s08--U19 s09--U06 s09--U19
## [29] s09--U20 s10--U01 s10--U11
plot(net2, vertex.label=V(net2)$name)

As with one-mode networks, we can modify the network object to include the visual properties that will be used by default when plotting the network. Notice that this time we will also change the shape of the nodes - UCCS employees will be squares, and students will be circles.

# Professors are blue squares, student nodes are orange circles:
V(net2)$color <- c("steel blue", "orange")[V(net2)$type+1]
V(net2)$shape <- c("square", "circle")[V(net2)$type+1]

#Professors will have name labels, students will not:
V(net2)$label <- ""
V(net2)$label[V(net2)$type==F] <- nodes2$name[V(net2)$type==F] 
V(net2)$label.cex=.6
V(net2)$label.font=2

plot(net2, vertex.label.color="white", vertex.size=(2-V(net2)$type)*8) 

plot(net2, vertex.shape="none", vertex.label=nodes2$name,
     vertex.label.color=V(net2)$color, vertex.label.font=2, 
     vertex.label.cex=.6, edge.color="gray70",  edge.width=2)

Below I include the use of images as nodes. In order to do this, you will need the png package (if missing, install with install.packages(‘png’)

# install.packages('png')
library('png')

img.1 <- readPNG("D:/Barboza Law/COC presentation/Day 3/computer.png")
img.2 <- readPNG("D:/Barboza Law/COC presentation/Day 3/User-Administrator-Green-icon.png")

V(net2)$raster <- list(img.1, img.2)[V(net2)$type+1]

plot(net2, vertex.shape="raster", vertex.label=NA,
     vertex.size=16, vertex.size2=16, edge.width=2)

Other ways to visualize and represent a network

Heatmap

netm <- get.adjacency(net, attr="weight", sparse=F)
colnames(netm) <- V(net)$name
rownames(netm) <- V(net)$name

palf <- colorRampPalette(c("gold", "dark orange")) 
heatmap(netm[,17:1], Rowv = NA, Colv = NA, col = palf(100), 
        scale="none", margins=c(10,10) )

Overlaying networks on geographic maps

library('maps')
library('geosphere')
## Warning: package 'geosphere' was built under R version 3.6.1
par(mfrow = c(2,2), mar=c(0,0,0,0))

map("usa", col="tomato",  border="gray10", fill=TRUE, bg="gray30")
map("state", col="orange",  border="gray10", fill=TRUE, bg="gray30")
map("county", col="palegreen",  border="gray10", fill=TRUE, bg="gray30")
map("world", col="skyblue",  border="gray10", fill=TRUE, bg="gray30")

airports <- read.csv("https://raw.githubusercontent.com/elisegia/COC/master/Dataset3-Airlines-NODES.csv", header=TRUE) 
flights <- read.csv("https://raw.githubusercontent.com/elisegia/COC/master/Dataset3-Airlines-EDGES.csv", header=TRUE, as.is=TRUE)

head(flights)
##   Source Target Freq
## 1      0    109   10
## 2      1     36   10
## 3      1     61   10
## 4      2    152   10
## 5      3    104   10
## 6      4    132   10
head(airports)
##   ID                     Label Code             City latitude  longitude
## 1  0       Adams Field Airport  LIT  Little Rock, AR 34.72944  -92.22444
## 2  1     Akron/canton Regional  CAK Akron/Canton, OH 40.91611  -81.44222
## 3  2      Albany International  ALB           Albany 42.73333  -73.80000
## 4  3                 Albemarle  CHO  Charlottesville 38.13333  -78.45000
## 5  4 Albuquerque International  ABQ      Albuquerque 35.04028 -106.60917
## 6  5  Alexandria International  AEX   Alexandria, LA 31.32750  -92.54861
##   ToFly Visits
## 1     0    105
## 2     0    123
## 3     0    129
## 4     1    114
## 5     0    105
## 6     0     93
# Select only large airports: ones with more than 10 connections in the data.
tab <- table(flights$Source)
big.id <- names(tab)[tab>10]
airports <- airports[airports$ID %in% big.id,]
flights  <- flights[flights$Source %in% big.id & 
                      flights$Target %in% big.id, ]

# Plot a map of the united states:
map("state", col="grey20", fill=TRUE, bg="black", lwd=0.1)

# Add a point on the map for each airport:
points(x=airports$longitude, y=airports$latitude, pch=19, 
       cex=airports$Visits/80, col="orange")

col.1 <- adjustcolor("orange red", alpha=0.4)
col.2 <- adjustcolor("orange", alpha=0.4)
edge.pal <- colorRampPalette(c(col.1, col.2), alpha = TRUE)
edge.col <- edge.pal(100)

for(i in 1:nrow(flights))  {
  node1 <- airports[airports$ID == flights[i,]$Source,]
  node2 <- airports[airports$ID == flights[i,]$Target,]
  
  arc <- gcIntermediate( c(node1[1,]$longitude, node1[1,]$latitude), 
                         c(node2[1,]$longitude, node2[1,]$latitude), 
                         n=1000, addStartEnd=TRUE )
  edge.ind <- round(100*flights[i,]$Freq / max(flights$Freq))
  
  lines(arc, col=edge.col[edge.ind], lwd=edge.ind/30)
}

To understand how the network helps inform the research question, there are two broad approaches:

  • Consider the network as a variable (e.g. regression); do network feature “predict” outcomes; (Does school isolation contribute to delinquency) and
  • Consider networks as structures with causal properties

Both leverage connection and position.

library(sna)
## Warning: package 'sna' was built under R version 3.6.3
## Loading required package: statnet.common
## Warning: package 'statnet.common' was built under R version 3.6.3
## 
## Attaching package: 'statnet.common'
## The following object is masked from 'package:base':
## 
##     order
## Loading required package: network
## Warning: package 'network' was built under R version 3.6.3
## network: Classes for Relational Data
## Version 1.16.0 created on 2019-11-30.
## copyright (c) 2005, Carter T. Butts, University of California-Irvine
##                     Mark S. Handcock, University of California -- Los Angeles
##                     David R. Hunter, Penn State University
##                     Martina Morris, University of Washington
##                     Skye Bender-deMoll, University of Washington
##  For citation information, type citation("network").
##  Type help("network-package") to get started.
## 
## Attaching package: 'network'
## The following objects are masked from 'package:igraph':
## 
##     %c%, %s%, add.edges, add.vertices, delete.edges,
##     delete.vertices, get.edge.attribute, get.edges,
##     get.vertex.attribute, is.bipartite, is.directed,
##     list.edge.attributes, list.vertex.attributes,
##     set.edge.attribute, set.vertex.attribute
## sna: Tools for Social Network Analysis
## Version 2.5 created on 2019-12-09.
## copyright (c) 2005, Carter T. Butts, University of California-Irvine
##  For citation information, type citation("sna").
##  Type help(package="sna") to get started.
## 
## Attaching package: 'sna'
## The following objects are masked from 'package:igraph':
## 
##     betweenness, bonpow, closeness, components, degree,
##     dyad.census, evcent, hierarchy, is.connected, neighborhood,
##     triad.census
library(network) 


# The emon dataset - interorganizational networks
#=============================================================#

# We'll use the emon dataset: interorganizational Search and Rescue
# Networks (Drabek et al.), included in the "network" package.
# The dataset contains 7 networks, each node has 8 attributes

#?emon
data(emon)
emon  # a list of 7 networks 
## $Cheyenne
##  Network attributes:
##   vertices = 14 
##   directed = TRUE 
##   hyper = FALSE 
##   loops = FALSE 
##   multiple = FALSE 
##   total edges= 83 
##     missing edges= 0 
##     non-missing edges= 83 
## 
##  Vertex attribute names: 
##     Command.Rank.Score Decision.Rank.Score Formalization Location Paid.Staff Sponsorship vertex.names Volunteer.Staff 
## 
##  Edge attribute names: 
##     Frequency 
## 
## $HurrFrederic
##  Network attributes:
##   vertices = 21 
##   directed = TRUE 
##   hyper = FALSE 
##   loops = FALSE 
##   multiple = FALSE 
##   total edges= 118 
##     missing edges= 0 
##     non-missing edges= 118 
## 
##  Vertex attribute names: 
##     Command.Rank.Score Decision.Rank.Score Formalization Location Paid.Staff Sponsorship vertex.names Volunteer.Staff 
## 
##  Edge attribute names: 
##     Frequency 
## 
## $LakePomona
##  Network attributes:
##   vertices = 20 
##   directed = TRUE 
##   hyper = FALSE 
##   loops = FALSE 
##   multiple = FALSE 
##   total edges= 148 
##     missing edges= 0 
##     non-missing edges= 148 
## 
##  Vertex attribute names: 
##     Command.Rank.Score Decision.Rank.Score Formalization Location Paid.Staff Sponsorship vertex.names Volunteer.Staff 
## 
##  Edge attribute names: 
##     Frequency 
## 
## $MtSi
##  Network attributes:
##   vertices = 13 
##   directed = TRUE 
##   hyper = FALSE 
##   loops = FALSE 
##   multiple = FALSE 
##   total edges= 33 
##     missing edges= 0 
##     non-missing edges= 33 
## 
##  Vertex attribute names: 
##     Command.Rank.Score Decision.Rank.Score Formalization Location Paid.Staff Sponsorship vertex.names Volunteer.Staff 
## 
##  Edge attribute names: 
##     Frequency 
## 
## $MtStHelens
##  Network attributes:
##   vertices = 27 
##   directed = TRUE 
##   hyper = FALSE 
##   loops = FALSE 
##   multiple = FALSE 
##   total edges= 123 
##     missing edges= 0 
##     non-missing edges= 123 
## 
##  Vertex attribute names: 
##     Command.Rank.Score Decision.Rank.Score Formalization Location Paid.Staff Sponsorship vertex.names Volunteer.Staff 
## 
##  Edge attribute names: 
##     Frequency 
## 
## $Texas
##  Network attributes:
##   vertices = 25 
##   directed = TRUE 
##   hyper = FALSE 
##   loops = FALSE 
##   multiple = FALSE 
##   total edges= 186 
##     missing edges= 0 
##     non-missing edges= 186 
## 
##  Vertex attribute names: 
##     Command.Rank.Score Decision.Rank.Score Formalization Location Paid.Staff Sponsorship vertex.names Volunteer.Staff 
## 
##  Edge attribute names: 
##     Frequency 
## 
## $Wichita
##  Network attributes:
##   vertices = 20 
##   directed = TRUE 
##   hyper = FALSE 
##   loops = FALSE 
##   multiple = FALSE 
##   total edges= 149 
##     missing edges= 0 
##     non-missing edges= 149 
## 
##  Vertex attribute names: 
##     Command.Rank.Score Decision.Rank.Score Formalization Location Paid.Staff Sponsorship vertex.names Volunteer.Staff 
## 
##  Edge attribute names: 
##     Frequency
# Cheyenne network (the first one in the list)
ch.net <- emon$Cheyenne 
plot(ch.net)

# Extract the node attributes of the Cheyenne network into a data frame
ch.attr <- data.frame(matrix (0,14,8))
colnames(ch.attr) <-   c("vertex.names", "Command.Rank.Score", "Decision.Rank.Score", "Formalization", "Location", "Paid.Staff", "Sponsorship", "Volunteer.Staff")

# Copy each of the 8 vertex attributes to a variable in the ch.attr data frame
for (i in 1:8) { ch.attr[,i] <- (ch.net %v% colnames(ch.attr)[i]) }

ch.attr
##                                      vertex.names Command.Rank.Score
## 1       Wyoming.Disaster.and.Civil.Defense.Agnecy                  0
## 2     Wyoming.State.National.Guard..Army.and.Air.                 10
## 3                    Wyoming.State.Highway.Patrol                  3
## 4    Francis.E..Warren.Air.Force.Base..fire.unit.                  5
## 5                                       Red.Cross                  0
## 6                                  Salvation.Army                  0
## 7                 Laramie.County.Sheriff.s.Office                 20
## 8  Laramie.County...Cheyenne.Civil.Defense.Agency                 40
## 9                        Cheyenne.Fire.Department                 10
## 10                     Cheyenne.Police.Department                 30
## 11    Laramie.County.Fire.District..1..volunteer.                  2
## 12    Laramie.County.Fire.District..2..volunteer.                  2
## 13                          A.1.Ambulance.Service                  2
## 14                          Shy.Wy.HAM.Radio.Club                  0
##    Decision.Rank.Score Formalization Location Paid.Staff Sponsorship
## 1                   20             2        L         10       State
## 2                    7             1        L        400       State
## 3                    0             1        L        200       State
## 4                    5             1        L         60     Federal
## 5                    0             1        L          1     Private
## 6                    0             1        L          7     Private
## 7                   20             1        L         60      County
## 8                   50             2        L          7 County/City
## 9                   10             1        L         70        City
## 10                  20             3        L        100        City
## 11                   3            NA        L         NA      County
## 12                   3             1        L          1      County
## 13                   2             1        L         10     Private
## 14                   0             1        L          0     Private
##    Volunteer.Staff
## 1               50
## 2             2000
## 3                0
## 4                0
## 5               20
## 6               80
## 7               20
## 8              100
## 9                0
## 10               0
## 11              NA
## 12              20
## 13               6
## 14              NA
# Correlation & Linear Regression in R
#=============================================================#

# Check the correlation between command rank and decision rank scores:
# (remember that ch.attr[[2]] is the same as ch.attr$Command.Rank.Score, etc.)

# Calculate the correlation btw command and decision rank
cor(ch.attr$Command.Rank.Score, ch.attr$Decision.Rank.Score)      
## [1] 0.8718257
cor.test(ch.attr$Command.Rank.Score, ch.attr$Decision.Rank.Score) # Examine the significance
## 
##  Pearson's product-moment correlation
## 
## data:  ch.attr$Command.Rank.Score and ch.attr$Decision.Rank.Score
## t = 6.1658, df = 12, p-value = 4.83e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6349626 0.9588618
## sample estimates:
##       cor 
## 0.8718257
# Linear regression: lm(DV ~ IV1 + IV2 + IV3 + ...) 
# DV = dependent variable, IV = independent variables

# Calculate centrality measures and store them in the ch.attr data frame:

ch.attr$IndegCent  <- degree(ch.net, gmode="digraph", cmode="indegree")  # indegree centralities
ch.attr$OutdegCent <- degree(ch.net, gmode="digraph", cmode="outdegree") # outdegree centralities 
ch.attr$BetweenCent <- betweenness(ch.net, gmode="digraph") # betweenness centralities

# We can use the centralities in a linear regression model:

ch.lm.1 <- lm(ch.attr$Command.Rank.Score ~ ch.attr$Paid.Staff + ch.attr$BetweenCent) # Model with betweenness centrality
summary(ch.lm.1)
## 
## Call:
## lm(formula = ch.attr$Command.Rank.Score ~ ch.attr$Paid.Staff + 
##     ch.attr$BetweenCent)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.271  -5.038  -2.433   1.739  26.225 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)  
## (Intercept)          2.34254    4.68527   0.500   0.6279  
## ch.attr$Paid.Staff   0.01285    0.02886   0.445   0.6657  
## ch.attr$BetweenCent  0.88864    0.38665   2.298   0.0444 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.39 on 10 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3498, Adjusted R-squared:  0.2198 
## F-statistic:  2.69 on 2 and 10 DF,  p-value: 0.1162

Fake Data

Below I used the same procedure to cluster the fake corrections data. There were only 2 numeric variables in the data set so I could only use those. This illustrates that if something doesn’t visualize well, there will be an alternative way that may be relatively “better.”

We are clustering mhneeds_lvl and subs_lvl_values across incident codes (ldesc). Instead of visualizng the clusters I created a dendogram using fviz_dend.

library(tidyverse)  # data manipulation
## Registered S3 method overwritten by 'rvest':
##   method            from
##   read_xml.response xml2
## -- Attaching packages ---------------------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v tibble  2.1.3     v purrr   0.3.2
## v tidyr   1.0.0     v dplyr   0.8.3
## v readr   1.3.1     v stringr 1.4.0
## v tibble  2.1.3     v forcats 0.4.0
## Warning: package 'tibble' was built under R version 3.6.1
## Warning: package 'tidyr' was built under R version 3.6.1
## Warning: package 'dplyr' was built under R version 3.6.1
## -- Conflicts ------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::as_data_frame() masks tibble::as_data_frame(), igraph::as_data_frame()
## x purrr::compose()       masks igraph::compose()
## x tidyr::crossing()      masks igraph::crossing()
## x dplyr::filter()        masks stats::filter()
## x dplyr::groups()        masks igraph::groups()
## x dplyr::lag()           masks stats::lag()
## x purrr::map()           masks maps::map()
## x purrr::simplify()      masks igraph::simplify()
library(cluster)    # clustering algorithms
## 
## Attaching package: 'cluster'
## The following object is masked from 'package:maps':
## 
##     votes.repub
library(factoextra) # clustering algorithms & visualization (fvis cluster and fvis distance)
## Warning: package 'factoextra' was built under R version 3.6.1
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
fake_data <- read.csv("https://raw.githubusercontent.com/elisegia/COC/master/fake_data.csv")
#Select only data on the columns of interest. Here I selected the variable ldesc and the two numeric variables: #mhneeds_lvl and subs_lvl_values.
df <- fake_data[, c(9, 4:5)]

#This command aggregates the data by ldesc and calculates the overall means for the mhneeds and substance use needs for each ldesc 
df <- aggregate(fake_data[, 4:5], list(ldesc=fake_data$ldesc), mean)

#We cannot cluster on the variable ldesc so below we set that column to be labels
df2 <- df[,-1]
rownames(df2) <- df[,1]

#Here we scale the data to normalize it, by subtracting each variable from the mean and then dividing by std dev
df2 <- scale(df2)

#The command head is like glimpse and allows us to take a look at a few rows of the data, the default is 5 rows
head(df2)
##                 mhneeds_lvl subs_lvl_values
## ACT OF SABOTAGE  -1.3839785       1.0960671
## ESCAPE W/ FORCE   0.9999968       1.0379304
## PROPERTY DAMAGE   0.7597913      -0.8025534
## SEXUAL ACT       -0.6551899      -0.3567935
## USE OF FORCE      0.2793802      -0.9746506
#This calculates the distance between observations
distance <- get_dist(df2)

#This command visualizes the distance. Note: you can change these colors to anything you want.
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

final <- kmeans(df2, 3, nstart = 25)
fviz_cluster(final,  data = df2)

# Visualize dendrogram using the cluster agglomeration method "complete" which computes all pairwise dissimiarities
hc.cut <- hcut(df2, k = 3, hc_method = "complete")
fviz_dend(hc.cut, show_labels = TRUE, rect = TRUE)

Property damage and use of force are most similar, then sexual act, etc.

Data Transformation

As we discussed last week, the data are rarely, if ever, in the format you need. At a minimum, new variables must be created, the variables may need recoding, variable summaries will need to be created, and often its important to rename variables. One of the most powerful packages to reformat data in R is called dplyr. We will cover dplyr basics below. Let’s use a different data set on crimes from the city of Los Angeles.

Let’s take a look at the data we will be using. The way these data are made available is very common, as all big, open data websites allow multiple methods of downloading the data. Additionally, if the data are in spatial format, there are several options for downloading the spatial data as well. Note, sometimes the data are spatial but are made available in a non-spatial way, such as a csv file. No need to worry because those data can be transformed and then mapped accordingly.

From the city of LA open website, click data catalog. This is data we saw last week. Notice the column called “Crm Cd Desc” of the types of crime. Let’s look at data on domestic violence. There are two ways to filter these data.

  1. In the box titled “Find in this Dataset” type “intimate partner - aggravated” then click “Export” and “CSV for excel.” This results in a downloaded file containing only these types of crimes. There were 1,363 cases of aggravated domestic assault committed in the city of LA since Jan 1, 2020 (as of July 5, 2020).

Download the data to your computer and open the file. First, open it in Excel.

I already uploaded this data to my githum, let’s load it into RStudio.

Let’s call this data set IPV_data.

IPV_data <- read.csv("https://raw.githubusercontent.com/elisegia/COC/master/Crime_Data_from_2020_to_Present_import.csv")

#glimpse allows us to view the first few rows of the data
glimpse(IPV_data)
## Observations: 1,363
## Variables: 40
## $ ï..DR_NO          <int> 200206684, 200104417, 200104511, 200104664, ...
## $ Date.Rptd         <fct> 2/23/20 0:00, 1/6/20 0:00, 1/8/20 0:00, 1/9/...
## $ DATE.OCC          <fct> 2/23/20 0:00, 1/6/20 0:00, 1/8/20 0:00, 1/9/...
## $ TIME.OCC          <int> 315, 2100, 400, 1900, 2130, 2300, 800, 725, ...
## $ newdate           <fct> 02/23/2020, 01/06/2020, 01/08/2020, 01/09/20...
## $ formatted_newdate <fct> 02/23/2020, 01/06/2020, 01/08/2020, 01/09/20...
## $ serialdate        <int> 43884, 43836, 43838, 43839, 43839, 43843, 43...
## $ day               <int> 23, 6, 8, 9, 9, 13, 1, 18, 18, 29, 27, 28, 2...
## $ month             <int> 2, 1, 1, 1, 1, 1, 3, 1, 1, 6, 1, 1, 1, 2, 2,...
## $ year              <int> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 20...
## $ weekday           <fct> Wed, Thu, Thu, Mon, Sun, Sat, Sat, Mon, Mon,...
## $ time_text         <dbl> 0.13541667, 0.87500000, 0.16666667, 0.791666...
## $ time_serial       <fct> 03:15, 21:00, 04:00, 19:00, 21:30, 23:00, 08...
## $ time              <fct> 3:15 AM, 9:00 PM, 4:00 AM, 7:00 PM, 9:30 PM,...
## $ time.1            <fct> 3:15 AM, 9:00 PM, 4:00 AM, 7:00 PM, 9:30 PM,...
## $ date_time         <fct> 02/23/2020 0.135416666666667, 01/06/2020 0.8...
## $ AREA              <int> 2, 1, 1, 1, 1, 1, 21, 1, 1, 5, 1, 1, 1, 1, 1...
## $ AREA.NAME         <fct> Rampart, Central, Central, Central, Central,...
## $ Rpt.Dist.No       <int> 256, 182, 154, 156, 138, 156, 2126, 111, 118...
## $ Part.1.2          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Crm.Cd            <int> 236, 236, 236, 236, 236, 236, 236, 236, 236,...
## $ Crm.Cd.Desc       <fct> INTIMATE PARTNER - AGGRAVATED ASSAULT, INTIM...
## $ Mocodes           <fct> 2000 1813 2002 0334 0445 1601 0913, 0913 031...
## $ Vict.Age          <int> 37, 26, 38, 45, 38, 46, 41, 22, 21, 42, 29, ...
## $ Vict.Sex          <fct> M, F, M, F, M, F, F, F, F, F, F, M, F, M, M,...
## $ Vict.Descent      <fct> H, H, B, B, B, W, O, W, H, W, B, H, H, O, B,...
## $ Premis.Cd         <int> 502, 502, 502, 502, 502, 502, 501, 502, 502,...
## $ Premis.Desc       <fct> "MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC...
## $ Weapon.Used.Cd    <int> 200, 400, 302, 400, 216, 400, 400, 400, 400,...
## $ Weapon.Desc       <fct> "KNIFE WITH BLADE 6INCHES OR LESS", "STRONG-...
## $ Status            <fct> AO, IC, AA, AA, AA, IC, AO, AA, AA, IC, AO, ...
## $ Status.Desc       <fct> Adult Other, Invest Cont, Adult Arrest, Adul...
## $ Crm.Cd.1          <int> 236, 236, 236, 236, 236, 236, 236, 236, 236,...
## $ Crm.Cd.2          <int> NA, NA, 998, NA, 998, NA, NA, NA, 998, NA, 9...
## $ Crm.Cd.3          <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Crm.Cd.4          <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ LOCATION          <fct> 600 S  UNION                        AV, 1200...
## $ Cross.Street      <fct> , , , , , WALL                         ST, ,...
## $ LAT               <dbl> 34.0565, 34.0423, 34.0474, 34.0449, 34.0503,...
## $ LON               <dbl> -118.2724, -118.2666, -118.2496, -118.2458, ...

There are several data types that we want to note. Also, there is a variable called “Premise Desc”. Let’s make a table of values so we can see the types of premises that are associated with IPV. The default table in R is ugly so I am going to use a library called kable (along with the commands of dplyr) which formats the table nicely for me. To use kable we need to load the knitr library.

Filtering data

library(knitr)
IPV_data %>%
  group_by(Premis.Desc)%>%
  summarize(n=n())%>%
  arrange(-n) %>%
  kable()
Premis.Desc n
SINGLE FAMILY DWELLING 457
MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC) 429
STREET 199
SIDEWALK 61
PARKING LOT 42
HOTEL 23
VEHICLE, PASSENGER/TRUCK 22
MOTEL 17
PARK/PLAYGROUND 12
ALLEY 11
DRIVEWAY 10
OTHER RESIDENCE 10
OTHER BUSINESS 7
MOBILE HOME/TRAILERS/CONSTRUCTION TRAILERS/RV’S/MOTORHOME 6
GAS STATION 5
SINGLE RESIDENCE OCCUPANCY (SRO’S) LOCATIONS 5
TRANSIENT ENCAMPMENT 5
YARD (RESIDENTIAL/BUSINESS) 5
ABANDONED BUILDING ABANDONED HOUSE 3
GARAGE/CARPORT 3
PARKING UNDERGROUND/BUILDING 3
FREEWAY 2
NURSING/CONVALESCENT/RETIREMENT HOME 2
OTHER PREMISE 2
RESTAURANT/FAST FOOD 2
TRAIN TRACKS 2
TRANSITIONAL HOUSING/HALFWAY HOUSE 2
BEACH 1
BEAUTY/BARBER SHOP 1
CAR WASH 1
ELEMENTARY SCHOOL 1
FIRE STATION 1
GROUP HOME 1
HOSPITAL 1
LAUNDROMAT 1
MINI-MART 1
MISSIONS/SHELTERS 1
PROJECT/TENEMENT/PUBLIC HOUSING 1
SHORT-TERM VACATION RENTAL 1
STUDIO (FILM/PHOTOGRAPHIC/MUSIC) 1
TATTOO PARLOR* 1
TRANSPORTATION FACILITY (AIRPORT) 1
UNDERPASS/BRIDGE* 1

Let’s select “SINGLE FAMILY DWELLING” and “MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC)” so we are sure this is domestic. We use “filter” to filter the data by rows in this way.

IPV_subset <- filter(IPV_data, Premis.Desc =="SINGLE FAMILY DWELLING" | Premis.Desc == "MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC)")

Looks Good!

Let’s create a heat map by day and month. We start by making a table of weekday and month. The we store the table results into a data frame. We do this so we can manipulate the variables a bit. For example, I want the days to be in a given order Mon - Sun. When we store data in this way, the variables names are by default. I changed them back to “Weekday” and “Month” and used ggplot with the geom geom_tile to visualize the results.

table(IPV_subset$weekday, IPV_subset$month)
##      
##        1  2  3  4  5  6
##   Fri 18 19 19 20 19 21
##   Mon 28 15 28 28 23 17
##   Sat 23 14 21 23 30 17
##   Sun 25 28 24 25 23 25
##   Thu 23 14 17 30 18 16
##   Tue 19 15 31 15 17 19
##   Wed 22 21 26 15 19 16
DayHourCount <- as.data.frame(table(IPV_subset$weekday, IPV_subset$month))
DayHourCount$Var1 <- factor(DayHourCount$Var1, ordered = T, levels = c("Mon", "Tue", "Wed", "Thu","Fri", "Sat","Sun"))
colnames(DayHourCount)[1] <- "Weekday"
colnames(DayHourCount)[2] <- "Month"

ggplot(DayHourCount, aes(x = Weekday, y = Month)) + 
  geom_tile(aes(fill = Freq))

Now, let’s select a few columns to make the data more manageable, and then look at some summary statistics.

IPV_subset_columns <- select(IPV_subset, formatted_newdate, TIME.OCC, day, month, time, Vict.Age, Vict.Sex, Vict.Descent, Mocodes, LAT, LON, Status.Desc)

Nice. Let’s create grouped summaries now. We do that with the group_by function in dplyr. When you use the dplyr commands on a grouped dataset they will automatically be applied to the group. This example also introduces the “pipe” operator which allows you to perform multiple operations at once.

library(dplyr)
IPV_subset_columns %>%
  group_by(Vict.Sex) %>%
  summarize(mean_age = mean(Vict.Age, na.rm = TRUE)) %>%
  kable()
Vict.Sex mean_age
F 32.63636
M 39.91593

Here is another example

IPV_subset_columns  %>%  
  group_by(Vict.Sex) %>% 
  filter(Vict.Descent =="W") %>% 
  summarize(
    count = n(), mean_age = mean(Vict.Age, na.rm=TRUE)
    )
## # A tibble: 2 x 3
##   Vict.Sex count mean_age
##   <fct>    <int>    <dbl>
## 1 F          113     34.9
## 2 M           37     40.8

And another… what does this code do?

IPV_subset_columns  %>%  
  group_by(Vict.Sex, Vict.Descent) %>% 
  summarize(
    count = n(), mean_age = mean(Vict.Age, na.rm=TRUE)
    )
## # A tibble: 13 x 4
## # Groups:   Vict.Sex [2]
##    Vict.Sex Vict.Descent count mean_age
##    <fct>    <fct>        <int>    <dbl>
##  1 F        A               15     28.7
##  2 F        B              211     32.8
##  3 F        F                1     48  
##  4 F        H              289     31.5
##  5 F        K                2     35  
##  6 F        O               29     35.3
##  7 F        W              113     34.9
##  8 M        A                4     40  
##  9 M        B               86     42.8
## 10 M        H               91     37.4
## 11 M        K                1     30  
## 12 M        O                7     34.3
## 13 M        W               37     40.8

Manipulating data in this way is useful for the visualization techniques below. For example, the Sankey diagram. First let’s do another exercise.

Exercise 6 Fake data

Calculate the mean levels of mental health needs for black and whites by incident description (ldesc). Here is the data:

str(fake_data)
## 'data.frame':    299 obs. of  11 variables:
##  $ ï..docno       : int  391501 803678 606522 482018 966031 841411 568293 146773 627755 939000 ...
##  $ ethnic_cd      : Factor w/ 8 levels "A","B","H","I",..: 1 6 4 6 1 2 8 1 2 2 ...
##  $ highschool_ged : Factor w/ 4 levels "G","H","M","N": 2 4 1 2 4 4 3 3 1 4 ...
##  $ mhneeds_lvl    : int  3 4 3 3 1 1 4 5 4 4 ...
##  $ subs_lvl_values: int  1 5 3 4 4 2 3 2 3 2 ...
##  $ off_cat        : int  31 31 31 31 31 31 31 31 31 31 ...
##  $ off_mdesc      : Factor w/ 5 levels "ABUSE OF PUBLIC OFFICE-MISDEMEANOR",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ incident_cod   : Factor w/ 5 levels "AC","ES","PD",..: 1 1 4 3 3 1 1 4 3 1 ...
##  $ ldesc          : Factor w/ 5 levels "ACT OF SABOTAGE",..: 1 1 4 3 3 1 1 4 3 1 ...
##  $ fac_cd         : Factor w/ 3 levels "DC","SF","YP": 1 3 1 3 3 3 2 3 3 2 ...
##  $ lu_cd          : Factor w/ 22 levels "1","1A","2","21",..: 13 18 13 22 20 21 5 22 19 1 ...
fake_data  %>%  
  group_by(XXXX, XXXX) %>% 
  filter(XXXX =="W" | XXXX == "B") %>% 
  dplyr::summarize(
    mental_health_level = mean(mhneeds_lvl, na.rm=TRUE)
    )
fake_data  %>%  
  group_by(ldesc, ethnic_cd) %>% 
  filter(ethnic_cd =="W" | ethnic_cd == "B") %>% 
  dplyr::summarize(
    mental_health_level = mean(mhneeds_lvl, na.rm=TRUE)
    )
## # A tibble: 10 x 3
## # Groups:   ldesc [5]
##    ldesc           ethnic_cd mental_health_level
##    <fct>           <fct>                   <dbl>
##  1 ACT OF SABOTAGE B                        2.75
##  2 ACT OF SABOTAGE W                        3.29
##  3 ESCAPE W/ FORCE B                        2.67
##  4 ESCAPE W/ FORCE W                        3.33
##  5 PROPERTY DAMAGE B                        3.17
##  6 PROPERTY DAMAGE W                        3.29
##  7 SEXUAL ACT      B                        3.11
##  8 SEXUAL ACT      W                        1.8 
##  9 USE OF FORCE    B                        1.83
## 10 USE OF FORCE    W                        3.4

Sankey Diagrams

A few things to note:

The Sankey diagram is in a different order to the data in the table. Sankey automatically orders the categories to minimize the amount of overlap between rows. Where two rows in the table contain the same information, Sankey automatically combines them. The Sankey diagram automatically merges together the nodes (blocks) that have the same values.

click here for a discussion about how to interpret these diagrams.

Sankey_data <-IPV_subset %>% 
  select(AREA.NAME, Vict.Sex, Vict.Descent, Status.Desc) %>% 
  group_by(AREA.NAME, Vict.Sex, Vict.Descent, Status.Desc) %>% 
  dplyr::summarize(
    count = n())

Let’s look at the resulting data

Sankey_data
## # A tibble: 291 x 5
## # Groups:   AREA.NAME, Vict.Sex, Vict.Descent [148]
##    AREA.NAME   Vict.Sex Vict.Descent Status.Desc  count
##    <fct>       <fct>    <fct>        <fct>        <int>
##  1 77th Street F        B            Adult Arrest    10
##  2 77th Street F        B            Adult Other     13
##  3 77th Street F        B            Invest Cont     36
##  4 77th Street F        H            Adult Arrest     7
##  5 77th Street F        H            Adult Other      4
##  6 77th Street F        H            Invest Cont     20
##  7 77th Street F        W            Adult Other      2
##  8 77th Street F        W            Invest Cont      3
##  9 77th Street M        B            Adult Arrest     5
## 10 77th Street M        B            Adult Other      5
## # ... with 281 more rows
library(ggalluvial)
ggplot(data = Sankey_data,
       aes(axis1 = AREA.NAME, axis2 = Vict.Sex, axis3 = Vict.Descent,
           y = count)) +
  scale_x_discrete(limits = c("Area", "Sex", "Race/Ethnicity"), expand = c(.1, .05)) +
  xlab("Demographic") +
  geom_alluvium(aes(fill = Status.Desc)) +
  geom_stratum() + geom_text(stat = "stratum", infer.label = TRUE) +
  theme_minimal() +
  ggtitle("Investigation Status of IPV Incidents",
          "Stratified by Victim demographics and Community") +
  theme(legend.position = 'bottom')

You will notice that the diagram needs to be cleaned up a bit. First, there are too many neighborhood areas represented, also the race/ethnicity variable has too few cases for some of the categories. Let’s select six neighborhood areas and redraw the chart.

Sankey_data <-IPV_subset %>% 
  select(AREA.NAME, Vict.Sex, Vict.Descent, Status.Desc) %>% 
  group_by(AREA.NAME, Vict.Sex, Vict.Descent, Status.Desc) %>% 
  filter(AREA.NAME == "77th Street" | AREA.NAME == "N Hollywood" | AREA.NAME == "Southeast"| AREA.NAME == "Southwest"| AREA.NAME == "Rampart"| AREA.NAME == "Olympic") %>%
  dplyr::summarize(
    count = n())

ggplot(data = Sankey_data,
       aes(axis1 = AREA.NAME, axis2 = Vict.Sex, axis3 = Vict.Descent,
           y = count)) +
  scale_x_discrete(limits = c("Area", "Sex", "Race/Ethnicity"), expand = c(.1, .05)) +
  xlab("Demographic") +
  geom_alluvium(aes(fill = Status.Desc)) +
  geom_stratum() + geom_text(stat = "stratum", infer.label = TRUE) +
  theme_minimal() +
  ggtitle("Investigation Status of IPV Incidents",
          "Stratified by Victim demographics and Community") +
  theme(legend.position = 'bottom')

Exercise 7

In this exercise you will create a Sankey diagram using the crime category description instead of crime area, along with dempgraphic characteristics gender and Race/Ethnicity First, create a data set called Sankey as we did above and select the relevant variables. For this excercise only select Blacks, Hispanics and Whites. Use the dplyr commands to format the data in the appropriate way. Follow the steps below:

Sankey_data <-IPV_subset %>% 
  select(XXXX, Vict.Sex, Vict.Descent, Status.Desc) %>% 
  group_by(XXXX, Vict.Sex, Vict.Descent, Status.Desc) %>% 
  filter(XXXX) %>%
  dplyr::summarize(
    count = n())
ggplot(data = Sankey_data,
       aes(axis1 = XXXX, axis2 = XXXX, axis3 = XXXX,
           y = count)) +
  scale_x_discrete(limits = c(XXXX), expand = c(.1, .05)) +
  xlab("Demographic") +
  geom_alluvium(aes(fill = XXXX)) +
  geom_stratum() + geom_text(stat = "stratum", infer.label = TRUE) +
  theme_minimal() +
  ggtitle("Investigation Status of IPV Incidents",
          "Stratified by Victim demographics and Place of Injury") +
  theme(legend.position = 'bottom')
Sankey_data <-IPV_subset %>% 
  dplyr::select(Premis.Desc, Vict.Sex, Vict.Descent, Status.Desc) %>% 
  dplyr::mutate(Premis.Desc = str_replace(Premis.Desc, ".*MULTI-UNIT.*", "Apartment")) %>%
  dplyr::mutate(Premis.Desc = str_replace(Premis.Desc, ".*SINGLE.*", "Family Home")) %>%
  dplyr::group_by(Premis.Desc, Vict.Sex, Vict.Descent, Status.Desc) %>% 
  dplyr::filter(Vict.Descent == "B" | Vict.Descent == "H" | Vict.Descent == "W") %>%
  dplyr::summarize(
    count = n())
ggplot(data = Sankey_data,
       aes(axis1 = Premis.Desc, axis2 = Vict.Sex, axis3 = Vict.Descent,
           y = count)) +
  scale_x_discrete(limits = c("Area", "Sex", "Race/Ethnicity"), expand = c(.1, .05)) +
  xlab("Demographic") +
  geom_alluvium(aes(fill = Premis.Desc)) +
  geom_stratum() + 
  geom_text(stat = "stratum", infer.label = TRUE) +
  theme_minimal() +
  ggtitle("Investigation Status of IPV Incidents",
          "Stratified by Victim demographics and Place of Injury") +
  theme(legend.position = 'bottom')

Fake Data

How do we get from X to the committed offense?

Sankey_data <-fake_data %>% 
  select(highschool_ged, ethnic_cd, mhneeds_lvl, off_mdesc) %>% 
  mutate(off_mdesc = str_replace(off_mdesc, ".*ABUSE .*", "Misdemeanor")) %>%
  mutate(off_mdesc = str_replace(off_mdesc, ".*HOMICIDE-.*", "Homicide")) %>%
  mutate(off_mdesc = str_replace(off_mdesc, ".*FINANCIAL .*", "Financial")) %>%
  mutate(off_mdesc = str_replace(off_mdesc, ".*CRIME .*", "Crime Act")) %>%
  mutate(off_mdesc = str_replace(off_mdesc, ".*CRIMINAL .*", "Criminal Attempt")) %>%
  group_by(highschool_ged, ethnic_cd, mhneeds_lvl, off_mdesc) %>% 
  filter(ethnic_cd == "B" | ethnic_cd == "H" | ethnic_cd == "W") %>%
  dplyr::summarize(
    count = n())
ggplot(data = Sankey_data,
       aes(axis1 = highschool_ged, axis2 = ethnic_cd, axis3 = mhneeds_lvl,
           y = count)) +
  scale_x_discrete(limits = c("GED Status", "Race/Ethnicity", "Mental Health"), expand = c(.1, .05)) +
  xlab("Demographic") +
  geom_alluvium(aes(fill = off_mdesc), main = "Offense Description") +
  geom_stratum() + 
  geom_text(stat = "stratum", infer.label = TRUE) +
  theme_minimal() +
  ggtitle("Investigation Status of IPV Incidents",
          "Stratified by Victim demographics and Place of Injury") +
  theme(legend.position = 'bottom')

Connecting to Big Open Data

Sometimes we would rather connect to the data directly from the provider’s website. These providers make their data available through the API. Some websites require you to register and get a token prior to accessing the API. This is a powerful way to connect to data, as the following examples demonstrate.

The R library that allows you to connect to, and query, a website is called RSocrata. This is like reading a csv file directly from a website. However, RSocrata allows you to query the data (i.e. the big data) before you open it! You do this by passing the query string directly into url, as follows.

Let’s open the la city crime data by directly accessing it from city of LA’s open data portal. Note, this will take a few seconds because it’s a very large file.

library(RSocrata)
base_url = "https://data.lacity.org/resource/2nrs-mtv8.json?" #this is 2020 to present

my_token <- "w0BkWUPZYzjQRwNEVX8KEijw4"
lacity_data <- read.socrata(base_url, my_token)
glimpse(lacity_data)
## Observations: 114,582
## Variables: 28
## $ dr_no          <chr> "010304468", "190101086", "191501505", "1919212...
## $ date_rptd      <dttm> 2020-01-08, 2020-01-02, 2020-01-01, 2020-01-01...
## $ date_occ       <dttm> 2020-01-08, 2020-01-01, 2020-01-01, 2020-01-01...
## $ time_occ       <chr> "2230", "0330", "1730", "0415", "0030", "1315",...
## $ area           <chr> "03", "01", "15", "19", "01", "01", "01", "01",...
## $ area_name      <chr> "Southwest", "Central", "N Hollywood", "Mission...
## $ rpt_dist_no    <chr> "0377", "0163", "1543", "1998", "0163", "0161",...
## $ part_1_2       <chr> "2", "2", "2", "2", "1", "1", "2", "1", "1", "2...
## $ crm_cd         <chr> "624", "624", "745", "740", "121", "442", "946"...
## $ crm_cd_desc    <chr> "BATTERY - SIMPLE ASSAULT", "BATTERY - SIMPLE A...
## $ mocodes        <chr> "0444 0913", "0416 1822 1414", "0329 1402", "03...
## $ vict_age       <chr> "36", "25", "76", "31", "25", "23", "0", "23", ...
## $ vict_sex       <chr> "F", "M", "F", "X", "F", "M", "X", "M", "M", "M...
## $ vict_descent   <chr> "B", "H", "W", "X", "H", "H", "X", "B", "A", "O...
## $ premis_cd      <chr> "501", "102", "502", "409", "735", "404", "726"...
## $ premis_desc    <chr> "SINGLE FAMILY DWELLING", "SIDEWALK", "MULTI-UN...
## $ weapon_used_cd <chr> "400", "500", NA, NA, "500", NA, NA, NA, "306",...
## $ weapon_desc    <chr> "STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)...
## $ status         <chr> "AO", "IC", "IC", "IC", "IC", "IC", "IC", "IC",...
## $ status_desc    <chr> "Adult Other", "Invest Cont", "Invest Cont", "I...
## $ crm_cd_1       <chr> "624", "624", "745", "740", "121", "442", "946"...
## $ location       <chr> "1100 W  39TH                         PL", "700...
## $ lat            <chr> "34.0141", "34.0459", "34.1685", "34.2198", "34...
## $ lon            <chr> "-118.2978", "-118.2545", "-118.4019", "-118.44...
## $ crm_cd_2       <chr> NA, NA, "998", NA, "998", "998", "998", "998", ...
## $ cross_street   <chr> NA, NA, NA, NA, NA, NA, NA, NA, "OLIVE", NA, NA...
## $ crm_cd_3       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ crm_cd_4       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...

Notice that all of the crimes committed were downloaded, over 95,000. This is only data from Jan 1, 2020 to the present. Notice also that the variable names are lower case, which is important because R is case sensitive.

I can query the data by creating a SQL statement and embedding it into the URL. Depending on the type of data you have, some query strings include: – is – between – is not – starts with – contains

Let’s create the string and query the data between June 1, 2020 and the present. I used a trick for the present date, which is to grab the computer’s system date (Sys.Date()) which should be today’s date.

Sys.Date() # Today's date
## [1] "2020-08-10"
(query_string = paste("'", Sys.Date(),"'", sep="")) # The query string has to be a particular format, surrounded by single quotes
## [1] "'2020-08-10'"
(base_url = paste("https://data.lacity.org/resource/2nrs-mtv8.json?$where=date_occ between '2020-06-01' and", query_string)) #this is 2020 to present
## [1] "https://data.lacity.org/resource/2nrs-mtv8.json?$where=date_occ between '2020-06-01' and '2020-08-10'"

Put the URL above into the browser to see the data that will be returned. If there is no data there is an error.

Note: The overwhelming majority of sites have this same format.

Now we can connect to the data as before.

lacity_data <- read.socrata(base_url, my_token)
glimpse(lacity_data)
## Observations: 32,626
## Variables: 28
## $ dr_no          <chr> "200610840", "201213939", "200709884", "2014116...
## $ date_rptd      <dttm> 2020-06-01, 2020-06-01, 2020-06-01, 2020-06-01...
## $ date_occ       <dttm> 2020-06-01, 2020-06-01, 2020-06-01, 2020-06-01...
## $ time_occ       <chr> "0930", "1200", "1030", "0030", "0930", "0115",...
## $ area           <chr> "06", "12", "07", "14", "20", "04", "02", "21",...
## $ area_name      <chr> "Hollywood", "77th Street", "Wilshire", "Pacifi...
## $ rpt_dist_no    <chr> "0646", "1259", "0711", "1488", "2033", "0497",...
## $ part_1_2       <chr> "1", "1", "1", "1", "1", "1", "1", "2", "1", "2...
## $ crm_cd         <chr> "210", "510", "510", "510", "236", "310", "510"...
## $ crm_cd_desc    <chr> "ROBBERY", "VEHICLE - STOLEN", "VEHICLE - STOLE...
## $ mocodes        <chr> "0400 0432 0344 0411 2033", NA, NA, NA, "2000 1...
## $ vict_age       <chr> "35", "0", "0", "0", "27", "0", "0", "25", "0",...
## $ vict_sex       <chr> "M", NA, NA, NA, "F", "X", NA, "F", NA, "M", "M...
## $ vict_descent   <chr> "W", NA, NA, NA, "H", "X", NA, "W", NA, "H", "W...
## $ premis_cd      <chr> "102", "104", "707", "108", "108", "405", "101"...
## $ premis_desc    <chr> "SIDEWALK", "DRIVEWAY", "GARAGE/CARPORT", "PARK...
## $ weapon_used_cd <chr> "500", NA, NA, NA, "512", NA, NA, NA, NA, "400"...
## $ weapon_desc    <chr> "UNKNOWN WEAPON/OTHER WEAPON", NA, NA, NA, "MAC...
## $ status         <chr> "IC", "IC", "IC", "IC", "IC", "IC", "IC", "IC",...
## $ status_desc    <chr> "Invest Cont", "Invest Cont", "Invest Cont", "I...
## $ crm_cd_1       <chr> "210", "510", "510", "510", "236", "310", "510"...
## $ location       <chr> "SELMA", "700 E  76TH                         P...
## $ cross_street   <chr> "VINE", NA, NA, NA, "7TH                       ...
## $ lat            <chr> "34.0998", "33.9703", "34.0782", "33.9553", "34...
## $ lon            <chr> "-118.3267", "-118.263", "-118.3732", "-118.385...
## $ crm_cd_2       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "99...
## $ crm_cd_3       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ crm_cd_4       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...

Exercise 8

Create a query string that selects only cases where the variable vict.sex is female only.

base_url = "https://data.lacity.org/resource/2nrs-mtv8.json?$where=vict_sex = 'F'"
lacity_data_females <- read.socrata(base_url, my_token)
glimpse(lacity_data_females)

Let’s make a wordcloud of the area names in the full data set.

library(RColorBrewer) 
library(wordcloud)
pal = brewer.pal(9,"Blues")
street_name <- as.data.frame(table(lacity_data$area_name))
colnames(street_name) <- c("Street_Name", "Count")
wordcloud(street_name$Street_Name, street_name$Count, min.freq = 100, max.words = 15, random.order = T, random.color = T, colors =c("green", "cornflowerblue", "darkred"), scale = c(2,.3))

library(treemap)
## Warning: package 'treemap' was built under R version 3.6.3
treemap_df <-
  lacity_data %>%
  dplyr::filter(str_detect(crm_cd_desc, "CHILD NEGLECT|CHILD ABUSE|INTIMATE PARTNER")) %>% 
  dplyr::group_by(area_name, crm_cd_desc) %>%
  dplyr::summarize(n = n())
treemap(treemap_df, 
        index=c("area_name","crm_cd_desc"), 
        vSize="n", 
        type="index",
        fontsize.labels=c(15,12),
        fontcolor.labels=c("white","orange"),
        fontface.labels=c(2,1), 
        bg.labels=c("transparent"),
        align.labels=list(
          c("center", "center"), 
          c("center", "top")
        ),                                 
        overlap.labels=0.2,                     
        inflate.labels=T
      )

Fake data

treemap_df <-
  fake_data %>%
  dplyr::group_by(ethnic_cd, ldesc) %>%
  dplyr::summarize(n = n())
treemap(treemap_df, 
        index=c("ethnic_cd","ldesc"), 
        vSize="n", 
        type="index",
        fontsize.labels=c(15,12),
        fontcolor.labels=c("white","orange"),
        fontface.labels=c(2,1), 
        bg.labels=c("transparent"),
        align.labels=list(
          c("center", "center"), 
          c("center", "top")
        ),                                 
        overlap.labels=0.2,                     
        inflate.labels=F
      )

Diversion

I want to divert for a second to show you how to get the mo codes into text and then introduce the idea of mapping.

ipv_crimes_in_la <- lacity_data %>%
  mutate(vict_age = as.numeric(vict_age)) %>%
  filter(str_detect(crm_cd_desc, "INTIMATE PARTNER"),
         str_detect(premis_desc, "MOTORHOME|GROUP HOME|MOTEL|DWELLING|RESIDENTIAL|HOUSING")) %>%
  mutate(crm_cd_desc = str_replace(crm_cd_desc, ".*INTIMATE PARTNER.*", "IPV"))%>%
  select("dr_no", "crm_cd_desc", "date_occ", "time_occ",  "lat", "lon", "mocodes","area_name",  "vict_age", "vict_sex", "vict_descent", "premis_desc", "weapon_desc", "status_desc")

glimpse(ipv_crimes_in_la)
## Observations: 1,607
## Variables: 14
## $ dr_no        <chr> "200910349", "201812154", "202110215", "201214247...
## $ crm_cd_desc  <chr> "IPV", "IPV", "IPV", "IPV", "IPV", "IPV", "IPV", ...
## $ date_occ     <dttm> 2020-06-01, 2020-06-04, 2020-06-05, 2020-06-06, ...
## $ time_occ     <chr> "0105", "1027", "1700", "2000", "1530", "0230", "...
## $ lat          <chr> "34.1576", "33.9274", "34.2071", "33.9821", "34.2...
## $ lon          <chr> "-118.4074", "-118.2871", "-118.5754", "-118.3318...
## $ mocodes      <chr> "2000 0416 1222", "2000 2002 0329", "2000 1241 04...
## $ area_name    <chr> "Van Nuys", "Southeast", "Topanga", "77th Street"...
## $ vict_age     <dbl> 29, 32, 31, 32, 47, 38, 73, 46, 36, 29, 58, 22, 3...
## $ vict_sex     <chr> "M", "F", "F", "F", "F", "F", "F", "F", "M", "F",...
## $ vict_descent <chr> "H", "B", "H", "B", "B", "B", "O", "B", "H", "H",...
## $ premis_desc  <chr> "MOTEL", "SINGLE FAMILY DWELLING", "MULTI-UNIT DW...
## $ weapon_desc  <chr> "STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)",...
## $ status_desc  <chr> "Invest Cont", "Invest Cont", "Adult Arrest", "In...
crimebyarea <-  ipv_crimes_in_la  %>%
  dplyr::group_by(crm_cd_desc, area_name) %>%
  dplyr::summarise(count= n())

crimebyarea
## # A tibble: 21 x 3
## # Groups:   crm_cd_desc [1]
##    crm_cd_desc area_name   count
##    <chr>       <chr>       <int>
##  1 IPV         77th Street   172
##  2 IPV         Central        56
##  3 IPV         Devonshire     46
##  4 IPV         Foothill       57
##  5 IPV         Harbor        108
##  6 IPV         Hollenbeck     49
##  7 IPV         Hollywood      54
##  8 IPV         Mission        98
##  9 IPV         N Hollywood    70
## 10 IPV         Newton         74
## # ... with 11 more rows
ipv_crimes_in_la$mo <- ipv_crimes_in_la$mocodes
ipv_crime_mo <- separate(data = ipv_crimes_in_la, col = mocodes, into = 
                c("m1", "m2", "m3", "m4", "m5", "m6", "m7", "m8", "m9", "m10"), 
                sep = " ")
## Warning: Expected 10 pieces. Missing pieces filled with `NA` in 1550
## rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
## 20, ...].
#makes all "m" variables numeric at once
ipv_crime_mo[,7:16] <- sapply(ipv_crime_mo[,7:16],as.numeric)
glimpse(ipv_crime_mo)
## Observations: 1,607
## Variables: 24
## $ dr_no        <chr> "200910349", "201812154", "202110215", "201214247...
## $ crm_cd_desc  <chr> "IPV", "IPV", "IPV", "IPV", "IPV", "IPV", "IPV", ...
## $ date_occ     <dttm> 2020-06-01, 2020-06-04, 2020-06-05, 2020-06-06, ...
## $ time_occ     <chr> "0105", "1027", "1700", "2000", "1530", "0230", "...
## $ lat          <chr> "34.1576", "33.9274", "34.2071", "33.9821", "34.2...
## $ lon          <chr> "-118.4074", "-118.2871", "-118.5754", "-118.3318...
## $ m1           <dbl> 2000, 2000, 2000, 2000, 2000, 421, 416, 2000, 124...
## $ m2           <dbl> 416, 2002, 1241, 1814, 1814, 913, 1202, 913, 381,...
## $ m3           <dbl> 1222, 329, 448, 913, 913, 2000, 2000, 1814, 1813,...
## $ m4           <dbl> NA, NA, 2002, 432, 416, 1814, 1241, 448, 2000, 44...
## $ m5           <dbl> NA, NA, 444, 444, NA, 416, 913, 444, 913, 417, 41...
## $ m6           <dbl> NA, NA, NA, 429, NA, NA, NA, 1813, 416, NA, 2002,...
## $ m7           <dbl> NA, NA, NA, 448, NA, NA, NA, 1309, 444, NA, NA, N...
## $ m8           <dbl> NA, NA, NA, 216, NA, NA, NA, NA, 446, NA, NA, NA,...
## $ m9           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 1821, NA, NA, NA,...
## $ m10          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ area_name    <chr> "Van Nuys", "Southeast", "Topanga", "77th Street"...
## $ vict_age     <dbl> 29, 32, 31, 32, 47, 38, 73, 46, 36, 29, 58, 22, 3...
## $ vict_sex     <chr> "M", "F", "F", "F", "F", "F", "F", "F", "M", "F",...
## $ vict_descent <chr> "H", "B", "H", "B", "B", "B", "O", "B", "H", "H",...
## $ premis_desc  <chr> "MOTEL", "SINGLE FAMILY DWELLING", "MULTI-UNIT DW...
## $ weapon_desc  <chr> "STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)",...
## $ status_desc  <chr> "Invest Cont", "Invest Cont", "Adult Arrest", "In...
## $ mo           <chr> "2000 0416 1222", "2000 2002 0329", "2000 1241 04...
tbl_lookup<-read.csv("D:/Barboza Law/COC presentation/Day 2/MO_CODES_Numerical_20180627.csv")
names(tbl_lookup)[1] <- "id"

for (i in 1:10){
  ipv_crime_mo[,(length(ipv_crime_mo)+1)] = tbl_lookup[match(ipv_crime_mo[,(i+6)], tbl_lookup$id), "descript"] 
}

library(writexl)
## Warning: package 'writexl' was built under R version 3.6.3
write_xlsx(ipv_crime_mo, "ipv_crime_mo.xlsx")

We know have all IPV incidents between June 1, 2020 to the present (whatever that date is, today happens to be July 8, but tomorrow that will be a different date). Let’s illustrate the power of data mining by returning all of the modus operandi codes where the partners were “gay.”

homosexual_data <- ipv_crime_mo %>% 
  filter_at(.vars = vars(V25, V26, V27, V28, V29, V30, V31, V32, V33, V34), 
            .vars_predicate = any_vars(str_detect(. , "Homo")))

Next, let’s start to explore how to cluster crime across space (and time). You may have noticed that this dataset contains the lattitude (Y) and longitude (X) associated with each crime incident. Let’s map the crimes we downloaded and see if we can see any unique relationships.

R has a number of powerful GIS-related functions. We will use a number of these functions to map our data. These appear below.

library(rgdal)
## Loading required package: sp
## 
## Attaching package: 'sp'
## The following object is masked from 'package:ggraph':
## 
##     geometry
## rgdal: version: 1.4-3, (SVN revision 828)
##  Geospatial Data Abstraction Library extensions to R successfully loaded
##  Loaded GDAL runtime: GDAL 2.2.3, released 2017/11/20
##  Path to GDAL shared files: C:/Users/g.barboza/Documents/R/win-library/3.6/rgdal/gdal
##  GDAL binary built with GEOS: TRUE 
##  Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
##  Path to PROJ.4 shared files: C:/Users/g.barboza/Documents/R/win-library/3.6/rgdal/proj
##  Linking to sp version: 1.3-1
library(sf)
## Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
library(sp)
library(spdep)
## Loading required package: spData
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`
library(tmap)
## Warning: package 'tmap' was built under R version 3.6.1
library(tigris)
## Warning: package 'tigris' was built under R version 3.6.1
## To enable 
## caching of data, set `options(tigris_use_cache = TRUE)` in your R script or .Rprofile.
## 
## Attaching package: 'tigris'
## The following object is masked from 'package:igraph':
## 
##     blocks
## The following object is masked from 'package:graphics':
## 
##     plot

In order to create a boundary of our region, we need to have a shapefile. I have already downloaded a shapefile of the city of LA, and it also contains data on COVID-19 cases across the city.

Open shapefile

lacity_bdry <- st_read("D:/Barboza Law/COC presentation/Day 2/shapefiles/covid-19 by neighborhood/nghbrhd_data.shp")%>%
  st_transform(4269)
## Reading layer `nghbrhd_data' from data source `D:\Barboza Law\COC presentation\Day 2\shapefiles\covid-19 by neighborhood\nghbrhd_data.shp' using driver `ESRI Shapefile'
## Simple feature collection with 127 features and 8 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 6359577 ymin: 1714640 xmax: 6514629 ymax: 1945542
## epsg (SRID):    NA
## proj4string:    +proj=lcc +lat_1=34.03333333333333 +lat_2=35.46666666666667 +lat_0=33.5 +lon_0=-118 +x_0=2000000 +y_0=500000.0000000002 +datum=NAD83 +units=us-ft +no_defs
plot(lacity_bdry)

In order to map the crime data onto the city boundary, we have to first transform it into a map file, which is really easy using the sf library in R. Here are the simple steps:

  • Convert the data into a point shapefile using the spatial coordinates of the data file
  • Make sure the CRS is the same for the point shapefile and boundary file, as follows:
points_sf <- st_as_sf(homosexual_data, coords = c("lon", "lat"))
st_crs(points_sf) <- st_crs(lacity_bdry)
plot(lacity_bdry$geometry)
plot(points_sf$geometry, col = "dark red", lwd = 4, cex = .4, add = TRUE)

Cool. There is clearly a cluster near Rampart.

Exercise #8

In this exercise, you will create a spatial point file of all incidents and overlay it onto the shapefile of la city. Since there is more data there is inevitably some data cleaning to do. First, we want to select only valid lat and long values. Look at the data and you will see that there are a number of lat and long with values of 0. Let’s remove them.

#ipv_crime_mo <- ipv_crime_mo[complete.cases(ipv_crime_mo), ] 
ipv_crime_mo$lon <- as.numeric(as.character(ipv_crime_mo$lon))
ipv_crime_mo$lat <- as.numeric(as.character(ipv_crime_mo$lat))
ipv_crime_mo <- ipv_crime_mo %>% filter(lon<0)

Exercise 9

points_sf <- st_as_sf(ipv_crime_mo, coords = c(XXXX))
st_crs(XXXX) <- st_crs(lacity_bdry)
plot(lacity_bdry$geometry)
plot(XXXX, col = "dark red", lwd = 4, cex = .4, add = TRUE)
points_sf <- st_as_sf(ipv_crime_mo, coords = c("lon", "lat"))
st_crs(points_sf) <- st_crs(lacity_bdry)
plot(lacity_bdry$geometry)
plot(points_sf$geometry, col = "dark red", lwd = 2, cex = .2, add = TRUE)

merged_crimes <- st_join(lacity_bdry,points_sf) %>% 
  dplyr::group_by(COMTY_NAME) %>%                  # group by unique tract ID
  dplyr::mutate(ipv_crimes = n()) %>% 
  dplyr::ungroup() %>% 
  dplyr::select(COMTY_NAME, ipv_crimes, geometry, count) %>% 
  dplyr::distinct(COMTY_NAME, ipv_crimes, geometry, count) 
## although coordinates are longitude/latitude, st_intersects assumes that they are planar

tmap basics

tmap is similar to ggmap in that each input dataset can be ‘mapped’ in a range of different ways including location on the map (defined by data’s geometry), color, and other visual variables. The basic building block is tm_shape() (which defines input data, raster and vector objects), followed by one or more layer elements such as tm_fill() and tm_dots(). This layering is demonstrated in the chunk below.

The object passed to tm_shape() in this case is merged_crimes, an sf object representing the the IPV incident counts in census tracts from the city of LA boundary file. Layers are added to represent merged_crimes visually, with tm_fill() and tm_borders() creating shaded areas (left panel) and border outlines

# Add fill layer to nz shape
tm_shape(merged_crimes) +
  tm_fill() 

# Add border layer to nz shape
tm_shape(merged_crimes) +
  tm_borders() 

# Add fill and border layers to nz shape
tm_shape(merged_crimes) +
  tm_fill() +
  tm_borders() 

tm_shape(merged_crimes) + 
  tm_polygons() 

tm_shape(merged_crimes) + 
  tm_fill(col = "ipv_crimes")

tm_shape(merged_crimes) + 
  tm_fill('ipv_crimes') +
  tm_borders('gray72') +
  tm_layout(legend.position = c('LEFT', 'bottom'),
            panel.label.size = 1, 
            panel.label.color = 'black',
            panel.label.bg.color = 'gainsboro', 
            panel.label.height = 1.25,
            main.title = 'IPV in Los Angeles Cases',
            main.title.size = 1.2) 

tm_shape(merged_crimes) +
  tm_polygons("count", title = "COVID-19 count", palette = "GnBu",
              style = "kmeans",
              legend.hist = T) +
  tm_shape(merged_crimes) +
  tm_bubbles("ipv_crimes", title.size = "Intimate Partner Violence", col = "gold")+
  tm_scale_bar(width = 0.22, position = c("left", "bottom")) +
  tm_compass(position = c("right", "bottom"))+
  tm_layout(frame = F, 
            title = "City of Los Angeles", 
            title.size = 2, 
            legend.hist.size = 0.5, 
            legend.outside = T) 

my_interactive_covid_dv_map<-tm_shape(merged_crimes) +
  tm_polygons("count", title = "COVID-19 count", palette = "GnBu",
              style = "kmeans",
              legend.hist = T) +
  tm_shape(merged_crimes) +
  tm_bubbles("ipv_crimes", title.size = "Intimate Partner Violence", col = "gold")+
  tm_scale_bar(width = 0.22, position = c("left", "bottom")) +
  tm_compass(position = c("right", "bottom"))+
  tm_layout(frame = F, 
            title = "City of Los Angeles", 
            title.size = 2, 
            legend.hist.size = 0.5, 
            legend.outside = T) 

tmap_mode("view")
## tmap mode set to interactive viewing
tmap_leaflet(my_interactive_covid_dv_map)
## Compass not supported in view mode.
## Scale bar width set to 100 pixels
## Legend for symbol sizes not available in view mode.
tmap_mode(mode="plot")
## tmap mode set to plotting
tm_shape(merged_crimes) +
  tm_polygons('#f0f0f0f0', border.alpha = 0.2) + 
  tm_shape(points_sf) + 
  tm_dots(size=.3, col="black") 

Working and merging census data

One of the most powerful open data sources is census data. There are a few functions in R that allow you to automatically connect to, download and merge census data into your analyses. Let’s try that with the next few commands. We have data from Los Angeles, so we will download data from LA county at the tract level and then intersect the shapefiles to return only data from LA city!

In order to use these functions, you need to get a census API key (as above). This is easy to get. Once you do, you can paste your key below and have your own.

library(tidycensus)
## Warning: package 'tidycensus' was built under R version 3.6.3
library(tidyverse)
options(tigris_use_cache = TRUE)

census_api_key("0f318090e26bc2bcdb1bda09a7ff3be3f421bf84", overwrite=TRUE)
## To install your API key for use in future sessions, run this function with `install = TRUE`.
readRenviron("~/.Renviron")

la_county <- get_acs(state = "CA", county = "Los Angeles", geography = "tract", 
                  variables = "B10052_002", geometry = TRUE)
## Getting data from the 2014-2018 5-year ACS
head(la_county)
## Simple feature collection with 6 features and 5 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -118.3224 ymin: 34.23124 xmax: -118.2652 ymax: 34.27895
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
##         GEOID                                                 NAME
## 1 06037101110 Census Tract 1011.10, Los Angeles County, California
## 2 06037101122 Census Tract 1011.22, Los Angeles County, California
## 3 06037101210 Census Tract 1012.10, Los Angeles County, California
## 4 06037101220 Census Tract 1012.20, Los Angeles County, California
## 5 06037101300    Census Tract 1013, Los Angeles County, California
## 6 06037101400    Census Tract 1014, Los Angeles County, California
##     variable estimate moe                       geometry
## 1 B10052_002        7  13 MULTIPOLYGON (((-118.3008 3...
## 2 B10052_002       28  25 MULTIPOLYGON (((-118.3032 3...
## 3 B10052_002        0  17 MULTIPOLYGON (((-118.2995 3...
## 4 B10052_002       38  39 MULTIPOLYGON (((-118.2859 3...
## 5 B10052_002        0  12 MULTIPOLYGON (((-118.2782 3...
## 6 B10052_002       10  15 MULTIPOLYGON (((-118.3224 3...

Searching for variables

#v17 <- load_variables(2017, "acs5", cache = TRUE)

#View(v17)
la_county %>%
  ggplot(aes(fill = estimate)) + 
  geom_sf(color = NA) + 
  coord_sf(crs = 26911) + 
  scale_fill_viridis_c(option = "magma") 

racevars <- c(White = "P005003", 

              Black = "P005004", 
              Asian = "P005006", 
              Hispanic = "P004003")

la_decennial <- get_decennial(geography = "tract", variables = racevars, 
                  state = "CA", county = "Los Angeles County", geometry = TRUE,
                  summary_var = "P001001") 
## Getting data from the 2010 decennial Census
head(la_decennial)
## Simple feature collection with 6 features and 5 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -118.3224 ymin: 34.23124 xmax: -118.2653 ymax: 34.27895
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
## # A tibble: 6 x 6
##   GEOID  NAME     variable value summary_value                     geometry
##   <chr>  <chr>    <chr>    <dbl>         <dbl>           <MULTIPOLYGON [°]>
## 1 06037~ Census ~ White     2656          4731 (((-118.2886 34.25591, -118~
## 2 06037~ Census ~ White     2437          3664 (((-118.2773 34.26196, -118~
## 3 06037~ Census ~ White     2890          5990 (((-118.2886 34.24861, -118~
## 4 06037~ Census ~ White     1662          3363 (((-118.278 34.24961, -118.~
## 5 06037~ Census ~ White     3190          4199 (((-118.2773 34.25991, -118~
## 6 06037~ Census ~ White     2649          3903 (((-118.3026 34.25239, -118~
la_decennial %>%
  mutate(pct = 100 * (value / summary_value)) %>%
  ggplot(aes(fill = pct)) +
  facet_wrap(~variable) +
  geom_sf(color = NA) +
  coord_sf(crs = 26911) + 
  scale_fill_viridis_c()

st_crs(la_decennial) <- st_crs(lacity_bdry)
## Warning: st_crs<- : replacing crs does not reproject data; use st_transform
## for that
plot(x_and_y <- st_intersection(la_decennial, lacity_bdry), col = "lightgrey", main="st_intersection")
## although coordinates are longitude/latitude, st_intersection assumes that they are planar
## Warning: attribute variables are assumed to be spatially constant
## throughout all geometries
## Warning: plotting the first 10 out of 13 attributes; use max.plot = 13 to
## plot all

x_and_y %>%
  mutate(pct = 100 * (value / summary_value)) %>%
  ggplot(aes(fill = pct)) +
  facet_wrap(~variable) +
  geom_sf(color = NA) +
  coord_sf(crs = 26911) + 
  scale_fill_viridis_c()

tm_shape(x_and_y) +
  tm_polygons("value", title = "race", palette = "GnBu",
              style = "kmeans") + 
  tm_facets(by = "variable", drop.NA.facets = TRUE) +
  tm_shape(points_sf) + tm_dots(size=.1, title.size = "Intimate Partner Violence", col = "gold") 

Fake Data

In this final exercise you will find census data for the state of Colorado, county of Denver, and map it.

options(tigris_use_cache = TRUE)

census_api_key("0f318090e26bc2bcdb1bda09a7ff3be3f421bf84", overwrite=TRUE)
readRenviron("~/.Renviron")

denver_county <- get_acs(state = "XXXX", county = "XXXX", geography = "XXXX", 
                  variables = "XXXX", geometry = TRUE)
options(tigris_use_cache = TRUE)

census_api_key("0f318090e26bc2bcdb1bda09a7ff3be3f421bf84", overwrite=TRUE)
## To install your API key for use in future sessions, run this function with `install = TRUE`.
readRenviron("~/.Renviron")

denver_county <- get_acs(state = "CO", county = "Denver County", geography = "tract", 
                  variables = "B10052_002", geometry = TRUE)
## Getting data from the 2014-2018 5-year ACS
tm_shape(denver_county) +
  tm_polygons("estimate", title = "Number of Disabled in Denver Co", palette = "GnBu",
              style = "kmeans",
              legend.hist = T) +
  tm_shape(denver_county) +
  tm_bubbles("moe", title.size = "Margin of Error", col = "gold")+
  tm_scale_bar(width = 0.22, position = c("left", "top")) +
  tm_compass(position = c("right", "bottom"))+
  tm_layout(frame = F, 
            title = "Denver County", 
            title.size = 2, 
            legend.hist.size = 0.5, 
            legend.outside = T)